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FOREWORD 


Tus issue of the Review is the sixth devoted to research technics in 
education. The several chapters fall into three major categories as follows: 
(a) historical and status research (Chapters I-II); (b) experimentation 
and statistical analyses (Chapters III-VI); and (c) observational and 
test technics (Chapters VII-VIII). 

It is recognized that in many instances the results or findings of re- 
search studies may be inextricably tied in with the methodology, and that 
a review of research technics may therefore require some attention to 
research findings. The emphasis in this issue, however, is primarily upon 
research methodology rather than research results. 

In certain respects the content of this issue differs from that of the 
previous issues of the same title. First, the chapter on individual differences 
has been eliminated. This was done because much of the material that 
would have been a part of such a chapter was treated by Professor Ruth 
Strang in the third chapter of the December 1949 issue of the REVIEW. 
Moreover, certain aspects of the content covered in this chapter in previous 
issues have been incorporated in the chapters on observational procedures 
and tests. 

Second, the chapter on applications of experimental design and analysis 
has purposes different from most Review chapters. The purposes of this 
chapter are (a) to cite selected studies which are illustrative of certain 
fairly common misapplications of technics in educational research, and 
(b) to cite selected studies which might be useful as examples of both 
sound and unsound technics and good and poor reporting. Consequently, 
special emphasis is placed upon the critical evaluation of the technics 
employed in the studies reviewed. Also, while a comprehensive survey of 
experimental studies was undertaken by the authors, only a selected 
number are reviewed. A comprehensive review of recent developments in 
the field of experimental design and analysis is included in the chapter 
on developments in statistical theory. 

Third, a chapter on factor analysis has been added. Tho shorter treat- 
ments of factor methods and results have appeared in previous issues of 
the Review during the past 12 years, this is the first full-scale treatment 
attempted since the December 1939 issue. 

Fourth, the chapter on tests as research instruments has been limited 
largely to a consideration of general theoretical issues relating to test 
construction and use. This delimitation resulted largely from an attempt to 
avoid a repetition of the material covered and references cited in the 
February 1950 issue of the Review which was devoted entirely to edu- 
cational and psychological measurement. 

Important trends and summary evaluations for each area are presented 
by the respective chapter authors. In the general area of historical and 
status research, one interesting trend is in the direction of cooperative- 
effort. This is reflected not only by the development of regional catalog, 
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bibliographical, and library storage centers, but also by the increase in 
comprehensive, cooperative, multisponsored survey projects. The general 
area of experimentation and statistical analysis is characterized by the 
rapid expansion and development of theoretical aspects and the marked 
lag between theory and application. This lag may be attributed in part 
to the fact that most of the theoretical advances have been the work 
of men in fields other than education and, consequently, are not readily 
adaptable to educational problems. Experimental studies in education 
continue to be characterized by poor reporting. It is safe to say that there 
has been virtually no improvement in this aspect of educational research 
in the last decade. The general area of observational and test technics is 
characterized by extensive interest in problems of social and individual 
adjustment and attention to the refinement of technics. 

Appreciation should be expressed to the men who prepared the individ- 
ual chapters for this issue of the Review. Educational research workers 
everywhere are immeasurably indebted to them. 


PauL BLomMers, Chairman 
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CHAPTER I 


Library Resources and Documentary Research 


CARTER V. GOOD 


Tus description of library resources, bibliographical technics, and 
documentary research brings up to date the similar chapter by Good (53) 
in the December 1948 issue of the Review. The topics treated include: 
(a) library services, manuals, and general aids; (b) guides to books and 
periodical literature; (c) encyclopedias and dictionaries; (d) guides to 
theses and selected research projects; (e) serial and occasional bibliogra- 
phies and summaries; (f) institutional and biographical directories or 
handbooks; and (g) historiography and principles of historical writing. 


Library Services, Manuals, and General Aids 


Trends in library reference services indicated that the next step is in 
the direction of cooperation, as reflected in the formation and expansion of 
regional union catalogs and bibliographic centers, and organization of 
cooperative storage libraries for infrequently used research materials (18). 
It was even suggested that one approach to successful housing and use of 
the tremendous volume of library materials may be thru an “automatic 
electronic library,” to which the customer could refer by remote control, 
utilizing a recording of the document and a wire communication network 
to play the record by remote control over a loud speaker or automatic 
typewriter anywhere (111). 

Barton (13) prepared a revision of her brief guide to reference works. 
Brickman (20, 21) reviewed in two instalments the chief reference works 
in education over the past few years. The third edition of Alexander and 
Burke’s (1) standard guide to educational literature included a number 
of changes and improvements: appropriate library exercises after each 
chapter, placement of all material on general technics and sources in 


Part I, and allocation to Part II of the details on finding specific educa- 
tional information. 


Guides to Books and Periodical Literature 


The guides to books included the best volumes of a decade (39), the 
annual lists of educational books (28, 29, 30), the annual selections of 
outstanding educational books (27, 85, 96), and books in education con- 
sidered especially significant during a two-year period (114). Especially 
outstanding was the appearance of the Third Mental Measurements Year- 
book (24), with its 713 original critical reviews, and 851 reprinted 
reviews of 663 tests for the 1940-1947 period. While the section on test 
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evaluation occupies the major portion of this volume, there are 200 pages 
of reviews of books on measurements, and also indexes of periodicals, 
publishers, tests, books, and names. The forthcoming edition of Statistical 
Methodology Reviews (23) will cover the 1941-1950 decade, and will follow 
in sequence the Second Yearbook of Research and Statistical Methodology 
(22). Two volumes of the Annual Review of Psychology (102, 103) were 
published as periodic summaries of contemporary psychological literature, 
with two major purposes: to cover publications considered noteworthy, 


and to emphasize an interpretative and evaluative approach to the 
literature. 


Encyclopedias and Dictionaries 


A major publication event was the revised Encyclopedia of Educational 
Research (83) in greatly improved and enlarged form. Worthy of note 
are the two-volume Handbook of Applied Psychology (48) and two 
dictionaries in the area of economics (66, 101). 


Guides to Theses and Selected Research Projects 


The guides in this area included comprehensive lists (109, 110) of 
dissertations completed and compilations of dissertations, theses, degrees, 
and research projects in the areas of education in general (51, 52, 69), 
Negro education (73), health and physical education (36), child life (44), 
sociology (3, 4, 5, 6, 7, 8, 9), political science (2, 45) modern languages 
(81, 82), and speech and hearing (72). 


Serial and Occasional Bibliographies and Summaries 


The monthly selected references in the School Review and Elementary 
School Journal continued as the leading serial bibliographies. Other major 
continuing bibliographies or summaries included the fields of research 
methods (54, 55, 56), history and philosophy of science (97), teacher 
supply and demand (42, 43), guidance (62, 63, 64), reading (58, 59, 60). 
and modern languages (34). Occasional bibliographies or summaries of 
considerable scope appeared in the areas of higher education (75), general 
education (79), junior college (35), teacher efficiency (12, 40), elemen- 
tary curriculum (100), youth (32), mentally handicapped children (71), 
psychiatric books (80), reading (108), speech (107), communications 
research (76), music (74), and business education (41). 


Institutional and Biographical Directories or Handbooks 


The American Council on Education published a noteworthy volume, 
Universities of the World Outside the U.S.A. (31), in its series of higher 
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education directories. Based on questionnaires and official publications, 
the new handbook covers many details for over 2000 higher institu- 
tions located in more than 70 foreign countries. Other institutional or 
biographical handbooks included the annual report of the American 
Council on Education (115), the third edition of Leaders in Education 
(26), scientific and technical organizations (86), and university presses 
(70). 


Historiography and Principles of Historical Writing 


Brickman (19) wrote a pioneer important book devoted wholly to 
historiography in education, altho there are many good treatments of the 
historical method in the field of history proper, including a number of 
titles listed in the bibliography of this chapter. In addition to the usual 
technics of historiography, Brickman has given literally hundreds of 
illustrations from educational history. His book represents a point of 
view emphasizing somewhat more the discovery and reporting of facts 
than selection of data for solution of a specific problem or in terms of a 
particular frame of reference. Perdew (89, 90) developed criteria for 
research in educational history. He reported that of 556 publications in 
the history of secondary education in America only 207 qualified as good 
historical research in terms of adequate treatment of purposes, presentation 
of facts, and generalizations; more than half of the reports were remi- 
niscence, chronology, or philosophy (89). 

Among the book-length treatments of problems of historiography and 
historical writing were general discussions of the methodology of the 
social sciences (50, 99), historical method in general (14, 57, 95, 104), 
research in American history (65), autobiography (91), documentation 
(17), five centuries of interpretation (46), analysis of the presidential 
addresses of the American Historical Association (10), theological im- 
plications (78), science in history (15), and work and history (98). The 
periodical literature included treatments of historical principles in relation 
to contemporary theory (38), postwar reorientation (87), intellectual 
history (61), social responsibility (92), military history overseas (25), 
relativism (77), the “presidential synthesis” (33), and rhetoric (113). 

Legal research on educational problems may be regarded as a special type 
of historical investigation, since adequate synthesis and interpretation of 
the principles of educational legislation, like good historiography, involve 
similar processes of collecting data, criticism, and interpretation. Among 
the recent treatments of legal problems (49, 68, 93, 94), Remmlein’s 
School Law (93) is outstanding, with a chapter organization in terms of 
problems of teacher personnel and pupil personnel, an appendix dealing 
with the legal guides, editorial comments at the beginning of each chapter 


and other editorial notes, and illustrative extracts from the statutory and 
case materials. 
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Examples of a variety of historical writing in a number of fields related 
to education may be cited: psychology (16, 84), sociology (11, 88), 
philosophy (47, 112), 200 years of social science (67), a century of 
archaeology (37), and science and scientific thought (105, 106). 

The number and scope of the historical and social treatises listed in 
this section offer additional testimony that, in a time of social crisis, edu- 
cational, civic, and even governmental leaders turn to the social sciences 
for direction in dealing with the perplexing problems created by physical 
science, invention, and technology. 
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CHAPTER II 


Trend and Survey Studies 


C. ROBERT PACE and ARTHUR D. BROWNE 


From the analysis of survey and trend studies reviewed in this chapter, 
several generalizations may be made. The significance of comprehensive, 
cooperative, and multisponsored survey projects, for example, is evident. 
Perhaps because of the increase in studies of this scope, there is also 
evident a growing concern about the procedures as well as the technics of 
surveys. By procedures is meant the consideration of who does what in 
the process of conducting the survey. While questionnaires continue to be 
the dominant device for data gathering, a variety of other technics have 
also been employed, particularly in the large scale and multisponsored 
projects. Along with this there are a number of articles which describe 
the use of these technics for those who are not professionally trained in 
research, but who are directly confronted by the problems in the field. 
Finally, and perhaps as an outgrowth of wider participation in surveys, 
the literature reflects a number of critical appraisals of current methodology 
and procedure. 


Surveys of Practices 


A survey of practices in meeting pupil-adjustment needs was reported 
by Nolan (59) from data obtained by a questionnaire to superintendents 
of schools in 110 Kansas cities. Forty-seven replies were received. Wimmer 
(88), from 447 questionnaires, summarized guidance procedures and 
practices at the secondary level. Witty and Brink (89) collected information 
about current practices in reading instruction by means of letters to 500 
school systems followed by questionnaires to schools with special remedial 
classes. Toulouse (78) described what provisions for curriculum develop- 
ment were in effect in several large cities. Boyd and Schwiering (7) 
surveyed the child guidance and remedial reading practices in clinics 
affiliated with institutions of higher learning, public schools, and inde- 
pendent organizations. The Pennsylvania branch of the National Asso- 
ciation of Secondary-School Principals (9) reported a study of school 
practices in the recruitment of teachers. Dale (14) surveyed curriculum 
problems and practices in Ohio schools. 

Michaelis (47) studied current practices in evaluation in city school 
systems. The data were collected by an analysis of handbooks, cumulative 
records, and bulletins on evaluation published by the school systems; by 
a checklist on selected aspects of evaluation; and by conferences with 
those in charge of evaluation in selected cities. 

In higher education, Arbuckle (5) described chapel services and 
religious counseling and teaching practices in 11 colleges. Nimkoff and 
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Wood (58), using questionnaire data from 66 women’s colleges, reported 
the proportion of women on the administrative and teaching staffs of those 
colleges and on the boards of trustees. The report also indicated the 
changes which had occurred in these percentages over a period of years. 
Obtaining his data by interviews in 46 colleges and universities, Wood- 
burne (91) described and commented on faculty personnel policies in 
higher education. 

In the above reports, while the questionnaire was the predominant 
method of inquiry, many of the studies did not indicate the proportion of 
returns from their inquiries. Where such information was given, the 


percent of returns varied from approximately one-half to two-thirds of the 
populations sampled. 


Surveys of Conditions and Facilities 


Surveys of business education were made in Massachusetts by Zimmer 
(96), and in New Jersey by Losi (41). Both surveys relied upon data 
obtained by questionnaires and both reported a high percentage of re- 
turns—80 percent in the former and 95 percent in the latter. A school 
building survey of San Francisco was reported by Engelhardt (24), using 
a rating card for elementary schools which was developed for the purpose 
of this survey. A U.S. Office of Education report by Kempfer (83) 
described adult education activities in the public schools, based upon a 
69 percent return from a checklist mailed to nearly 5000 school districts. 
The facilities for vocational education in Negro high schools in Texas, 
together with a study of employment conditions among Negroes, was 
reported by Bryant (8). Questionnaire returns from 68 percent of 673 
schoolboard members provided the data for Hunter’s (36) report on the 
social composition of Louisiana Parish School Boards. Trow (81) reported 
a survey of the organization, staff, clientele, types of problems accepted, 
and types of service rendered by 548 psychological service centers. 

Data collected by the Elmo Roper organization from a national sample 
of over 11,000 high-school seniors, together with intensive interviews of 
more than 5000 seniors from large city schools, provided the basic infor- 
mation for Davis’ (17) descriptive and analytic study of discrimination in 
college admissions. The study was sponsored by the American Council 
on Education. 

Among other examples of surveys of conditions, Putnam’s (65) survey 
of general courses in American colleges, Munn and Schlauch’s (52) survey 
of elementary statistics courses, and Hammond’s (31) analysis of the 
content of mathematics courses may be cited. A more comprehensive study 
was reported by Hanley (32), based upon an analysis of 266 textbooks 
which were appraised in the light of their handling of materials pertaining 
to intergroup relations. Correspondence with a large number of teachers 
and the use of many consultants from various fields of study were also part 
of the methodology of this investigation. 
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The Research Division of the National Education Association (55) 
continued to issue its quarterly reports, covering surveys of a variety of 
conditions in the public schools. The areas surveyed during the past 
three-year period included a study of the programs of local education 
associations, school-housing needs, trends in city school organizations, 
salaries and salary schedules, state legislation affecting school revenues, 
teachers in the public schools, pupil patrols, fiscal authority of city 
schoolboards, personnel and relationships in school health, physical edu- 
cation, and recreation, public-school retirement policies, and teaching 
loads. 


Surveys of Opinions and Judgments 


The judgments of students, teachers, administrators, and parents about 
a variety of educational practices, facilities, and conditions have been the 
focus of several investigations. Wedemeyer (85) used both a signed and 
unsigned questionnaire to determine the attitudes of the student body 
toward the educational program at the Racine Extension Center, University 
of Wisconsin. Lyman (42) constructed a 90-item scale for measuring 
students’ attitudes toward school, and reported differences between the 
responses of two schools which had markedly varied in the degree of 
cooperation and order shown in the test-taking behavior. McGrath (43) 
described a questionnaire for the evaluation of student teaching experience 
by student teachers. 

Answers to the question “what changes in your school within the past 
three years have been most helpful to you for doing a better job of 
teaching?” were analyzed by Cunningham, Applegate, and Hilliard (13). 
In this study teachers’ opinions and attitudes concerning a wide range of 
problems were also investigated and interviews were held with teachers 
in each of the school systems. Teachers, administrators, and board members 
responded to a questionnaire analyzed by Burtt and Campbell (10), giving 
their views on the strengths and weaknesses of administrative leadership. 
Erickson (26) surveyed 351 school administrators to determine their 
interest in various guidance and counseling programs. The administrators 
were also asked to indicate the degree of importance they attached to 
securing outside assistance in developing the various aspects of these 
programs. Fowler and Nelson (28) questioned 305 principals of central 
schools in New York State to determine their interest in a proposed 
extension service in guidance. 

Goodykoontz (30) analyzed several surveys which have been made to 
ascertain whether parents were acquainted with, and approved of, the 
present instructional programs in the schools. Slade (71) sent question- 
naires to both school people and parents to obtain their judgment regard- 
ing the responsibility of the schools in the development of children’s 
personalities. 


Travers and Niebuhr (80) administered a questionnaire to entering 
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freshmen of the several city colleges of New York to survey their vocational 
choices and to estimate the number who could be considered as prospective 
students of education. An attempt was made to ascertain the degree of 
certainty of these choices. A questionnaire survey among veteran students 
enrolled at Champlain College was reported by Michal-Smith (48), in 
which the purpose was to discover the students’ objectives, how intent 
they were in the pursuit of their careers, and to estimate what proportion 
needed guidance. Stone (74) reported that the results of students’ responses 
to the Mooney Check List were presented to the faculty, so that areas of 
improvement in the college program could be better recognized and 
remedied. Both information and attitude questions were included in a 
nationwide survey of some 10,000 students reported by Shimberg (70), 
in which greater information about world affairs was associated positively 
with greater international mindedness and awareness of the implications of 
events. 

Jersild and Tasch (38) studied the likes and dislikes of over 2000 
pupils in Grades I to XII. A series of short, open-ended questions ad- 
ministered as an individual interview in the first three grades, and 
administered as a questionnaire in the other grades was the basic technic 
for obtaining the data. Swenson (77) studied students’ interests by 
analyzing the content of 680 letters written by public-school children in 
Grades IX thru XII. 

While most of the studies reported under this heading used question- 
naire technics, there were a few examples of the use of open-ended 


questions and the analysis of personal documents. The use of attitude 
scales and tests was also noted. 


Comprehensive Surveys 


Comprehensive surveys of statewide programs in higher education have 
been reported from New York, Illinois, lowa, and Minnesota. In New York, 
for example, following the report of the Temporary Commission on the 
Need for a State University (56) and the formal establishment of the 
State University of New York in April 1949, two comprehensive surveys 
were made (56, 57). Questionnaires, interviews, visitations, special con- 
sultants, public hearings, and a variety of census data were used in these 
surveys. 

Russell (67) reported a study undertaken in 1949 by the U.S. Office of 
Education for the Governor of Illinois on the structure for the organi- 
zation and.control of the state system of tax-supported higher education. 
The report presents five proposals, objectively stated, with advantages 
and disadvantages of each. Strayer and others conducted a survey of the 
three main state institutions of higher education in Iowa (15). 

A report on higher education in Minnesota (50) drew together a 
variety of research studies to describe present conditions in Minnesota. 
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The economic and cultural resources of the state were assessed; student 
potential for education was studied and related to existing facilities for 
junior college, liberal arts, teacher, and university education in the state; 
future actions were recommended. Considerable dat« for these various 
inquiries were gathered from government statistics, census reports, school 
reports, from questionnaires, testing programs, follow-up studies, inter- 
views, and other technics. The report provides a good example of what 
can be accomplished by a series of specific investigations, many of them 
conducted on a high plane of experimental design and analysis. 

Price (64) summarized 12 state and national surveys pertaining to the 
need for public junior colleges, citing the common and varying recom- 
mendations which have been applied to a number of characteristics of 
junior colleges. Anderson (3) reported the initiation of the third major 
survey of medical education in the United States which has been con- 
ducted in the past 40 years. 

While the report of the President’s Commission on Higher Education 
(84) was published in 1947, it has provided an outstanding example of 
integration, analysis and recommendation on a national level. A significant 
characteristic, evident thruout the report, is the high degree of reliance 
upon statistical reports and objective data. 


Trends and Follow-up Studies 


Books, discussing trends, have been published in several fields related 
to education. The volume edited by Williamson (87) on trends in student 
personnel work is an example. Other examples may be cited in psychology 
(18, 19), higher education (53, 54), and educational research (2). For 
the most part, these volumes consist of the collection of papers presented 
at a conference. 

Heston (34) cited appreciable mean gains on the graduate record 
examinations by seniors over their performance on the examinations as 
sophomores. Mayo and Kinzer (45) noted the racial attitudes of white 
and Negro high-school students on the same test in 1940 and 1948, finding 
attitudes more favorable to the Negro in 1948. Dudycha (22) compared 
the responses of college freshmen tested in 1949 to 25 religious proposi- 
tions with the responses of college freshmen in 1930. No marked changes 
were noted. Finch and Gillenwater (27) administered the Thorndike- 
McCall Reading Scale to sixth-grade students in 1948 and noted the 
superior performance and increased variability of their scores in com- 
parison with sixth-graders in the same school who had been tested in 1931. 
Goette and Roy (29) reported a decrement over a 10-year period in 
scholastic aptitude scores of freshmen entering Rochester Junior College. 

Spencer (72) analyzed the changes in the role of deans of women in 
colleges and universities over an ll-year period. An inquiry form was 
sent to 887 colleges and universities and returned by 71 percent of them. 
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The survey covered such areas as academic qualifications and rank, titles, 
living arrangements, teaching responsibilities, administrative organization 
of personnel services, personnel functions of deans of women, and 
problems seen by deans of women. Mueller (51) sent a double post card 
to nearly 1000 colleges and universities, inquiring about their use of 
student ratings of faculty. Zeran and Jones (95) have described the 
results of a national survey of guidance and pupil personnel services. 

The follow-up studies cited below reveal the use of a variety of data- 
gathering methods and an emphasis upon analysis of relationships among 
various kinds of data gathered. Weinrich and Soper (86) studied the 
succeeding five-year performance records of 30,000 students who were 
enrolled in the eighth grade in 1940. The study analyzed reasons for 
dropout in relation to age, home background, scholastic record, personal 
qualities, post-school activities, and other variables. Anderson (4) in- 
vestigated the present marital and family status of graduates of the Cornell 
classes of 1919, 1920, and 1921. The data were gathered by questionnaire. 
Aldrich (1) conducted a follow-up study with a group of 1940 freshman 
girls who had received special guidance in social adjustment, comparing 
their subsequent activities and status with a group which had not received 
this special guidance. The inspection of records in the Student Counseling 
Bureau, Student Activities Bureau, Bureau of Admissions, disciplinary 
committees, Student Health Service, Alumni Association, and other campus 
offices yielded the basic data for her follow-up and comparison. McIntosh 
(44) appraised the present status of 1000 graduates of the Jarvis School 
for Boys in Toronto between 1936 and 1946. Information was gathered 
by letters, telephone calls, and personal interviews. Income, employment, 
and other factors were related to IQ groupings. Rothney and Roens (66) 
presented follow-up data, extending in some cases to 11] years after initial 
contact, from a study of counseling of high-school pupils in Arlington, 
Massachusetts. Pace (61) reported on the civic and political activities and 
opinions of a national sample of college graduates from data collected 
by the research department of Time magazine 


Self-Surveys and Cooperative Studies 


Enochs (25) reported the development of a checklist of characteristics 
of a good administrator by a group of administrators working together 
toward self-evaluation. Norwood (60) described the organization of 264 
county committees in Texas to study the schools. One function of the 
county committees was to submit questionnaires and opinionnaires to the 
people. At Danville, Illinois, citizens, consultants, and public-school em- 
ployees were organized into committees to survey the school system 
(16, 49). The committees studied population trends, buildings and sites, 
the educational program, and financial resources. Dodd and O’Brien (20) 
described the development and administration of an attitude survey and 
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its effects upon community planning in a housing project. A self-survey 
of Syracuse University was reported by Troyer and Pace (82). Approxi- 
mately 100 faculty members were organized into nine major committees 
to conduct investigations and make recommendations regarding such 
topics as curriculum, staff and instruction, administrative organization, 
personnel services, and so forth. Detailed questionnaires, interviews, and 
examination of university records were the primary sources of data for 
the survey. 

A number of articles (33, 69) have considered the strengths and weak- 
nesses and purposes of self-surveys. The main point of these articles has 
been the argument that direct involvement in the process of conducting a 
survey by those whose problems are being investigated paves the way for 
action upon the recommendations which emerge from the survey. The 
dangers of omitting certain areas of investigation and of using simple 
technics where more sophisticated ones would be desirable have been 
mentioned as weaknesses of self-surveys. 

Cook (11) reported the results of a cooperative college study in inter- 
group relations. Articles by Holley (35), Potthoff (63) and Cooper (12) 
have described attempts of the North Central Association to work co- 
operatively with member institutions on a variety of studies and research 
projects in liberal arts colleges and teachers colleges. Edgar (23) has 
described the organization of the Great Neck Schools survey, undertaken 
by the Institute of Field Studies at Teachers College, Columbia University. 
Five study-action groups (consisting of teachers, administrators, laymen, 
and consultants) conducted interviews, developed questionnaires, examined 
published materials and data from school records, observed practices, and 
held conferences and discussions. In the final stages an evaluation question- 
naire was given all participants to determine their attitudes toward the 
project. 

A number of survey and research bulletins have come from the Illinois 
Secondary School Curriculum program (37). Many of these are pamphlets 
describing how to conduct various kinds of studies. Checklists, tally 
sheets, inventories, questionnaires, and other data-gathering methods are 
presented with instructions for their use. The need for manuals of this 
kind probably increases in some relation to the number and variety of 
individuals who participate in surveys. A somewhat similar series of 
pamphlets has been produced by the Metropolitan School Study Council 
(46) of New York. 

One of the largest of the new cooperative studies is the Cooperative 
Program in Educational Administration (68, 90), subsidized in part by 
the W. K. Kellogg foundation. Regional centers have been established at 
Teachers College, Harvard, Texas, Peabody, Chicago, Ohio State, and 
Oregon. Each regional center, in turn, has established cooperative relation- 
ships with school administrators, board members, colleges and universities, 
and citizen groups within its service area. A variety of studies have been 
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planned, and some of them carried out in the initial stages, but the results 
are not yet published in the literature. 


Critical Appraisal of Trend and Survey Studies 


The questionnaire is still the chief method for conducting surveys. 
In many of these surveys, however, the sampling problem is still treated 
rather lightly. Too frequently, questionnaires are sent to an available 
population, rather than to a random or systematically selected population. 
While the proportion of returns in some studies has been as high as 60 or 
80 percent, the authors do not typically present statistical evidence that 
those who returned their questionnaires are representative of the larger 
population. The responsible public opinion polling agencies appear to be 
more sensitive to the need for adequate sampling than most of the workers 
in educational research. Woodward (92) has described the present state 
of polling methodology. Woodward and Harris (93) discussed the appli- 
cations of opinion research to vocational guidance. Parten (62) has pro- 
duced a guidebook on the practical procedures of surveys, polls, and 
samples. Likewise the significant volume of Stouffer and associates (75) 
on Measurement and Prediction and the volume edited by Lindquist (40) 
on Educational Measurement should be mentioned as major contributions. 

In questionnaire surveys which use open-ended or free-response items 
there is frequently no description of the important processes of coding and 
classification. 

The direct use of records and personal documents has resulted in several 
significant reports, which have been previously cited (1, 38, 47, 77). 
Another notable example is the monograph of Bloom and Broder (6) on 
problem-solving processes of college students. In an effort to analyze the 
nature of mental processes, the investigators trained students to think 
aloud when solving problems, and a complete record of their reports was 
analyzed. Travers (79), in reviewing procedures for the evaluation of the 
outcomes of teaching English has also noted that direct measurement has 
more promise than the objective, or best-answer type of examination. He 
cited as particularly promising the devices for determining the out-of- 
school reading activities of children. Strang (76) noted too much reliance 
on subjective evaluation in studies in personnel work. The need for more 
direct methods of arriving at judgments was mentioned. 

Katz (39), in discussing survey technics in the evaluation of morale, 
expressed a judgment that technical improvements in survey methods 
were of minor importance compared to the growth of the conception of 
research design as a basic frame for the planning and execution of a 
survey. 

The requirements of cooperative projects in which more emphasis may 
be placed on the organization of human relationships within the study 
than upon the technics and methods of research to be employed are some- 
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times in apparent conflict with the requirements of a sound research 
design. Corey (2) has suggested that the proper criterion for evaluating 
so-called action research studies or surveys is the degree of action which 
results from them. Cornell (2) has discussed the problem of getting action 
by means of the school survey made by outside experts. 

In discussing follow-up studies of college graduates, Pace (2) has 
suggested that the model for constructing the questionnaire should be 
an achievement test battery. He described a questionnaire constructed 
from this point of view which consisted of a series of attitude and 
opinion scales, each of which had been subjected to a scale analysis and 
item analysis, and for which test-retest reliability had been determined. 
Such a design greatly increases the opportunity for statistical analysis of 
the results. Wrightstone (94) discussed general trends in evaluation. 
Dressel and Schmid (21) surveyed and evaluated the research studies 
on the General Educational Development tests. 

In summary, the major problems which still need greater attention in 
the conduct of survey and trend studies are the problems of sampling, of 
experimental design, of the effective use of direct observation, and of 
the refinement of questionnaires. There is also a need to clarify the 
relative values and points of overlapping between self-surveys of co- 


operative studies on the one hand, and surveys by research experts on the 
other. 
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CHAPTER III 


Applications of Experimental Design and Analysis 


DEE W. NORTON and EVERET F. LINDQUIST 


Tus chapter is primarily concerned with two aspects of the studies 
reviewed, namely, the appropriateness of the experimental designs em- 
ployed, and the methods of statistical analysis applied to the data. The 
chief purpose of the chapter is to draw attention to some of the more 
serious or more frequently recurring errors that are currently being made 
in experimental design and analysis in educational research. Occasion is 
also taken to point out some instances of particularly competent treatment 
or the use of technics or processes new to educational research. This 
purpose does not demand a comprehensive review, and no claim for com- 
prehensiveness is made for this chapter, either in the sense that all 
reported research studies have been cons dered or that all aspects of 
design and analysis have been evaluated for each study mentioned. Many 
studies representing appropriate but routine applications of standard 
procedures have been omitted from the review, as have some studies pro- 
viding further illustrations of very common errors. Unfortunately, it has 
not always been possible in the space available for this chapter to attempt 
to describe each study in sufficient detail to make the critical comments 
meaningful in themselves, and the reader may often find it necessary to 
refer to the original research report to derive full meaning from these 
comments. This is in any event desirable, since it will give these criticisms 
their proper setting and will prevent unjustified generalizations concerning 
the general quality of each study. 

On the whole, the authors have been none too favorably impressed with 
the general quality of contemporary educational research so far as experi- 
mental design and analysis are concerned. The published studies give little 
evidence that the typical educational research worker has achieved a 
thoro understanding of the experimental design he has employed, or that 
he is familiar with recent developments in the field of experimental design. 
The reviewers have been particularly concerned with the many instances 
of inadequate or even unintelligible reporting of research studies. In many, 
perhaps in a majority of the studies considered, the reviewers have had to 
attempt to infer by indirect methods exactly what technics were used or 
procedures followed by the experimenter. In some instances, injustices 
may have been done to the experimenters on this account. Apparently there 
is serious need in educational research for more thoro and rigorous train- 
ing of research workers, both in the use of research technics and in report 
writing. Apparently, too, there is need for more rigorous editorial standards 
and more careful editing. 
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Methods Research 


Experiments Involving the Combining of Intact Groups 


Experimental studies designed to compare the effects of different in- 
structional procedures on pupil achievement constitute an important 
segment of educational research. There has been an increasing tendency 
to carry out methods experiments in actual school situations—a practice 
which is certainly to be encouraged. However, the problems of design and 
analysis of experiments involving intact school classes have not been 
adequately considered by many researchers. 

Johnson (24) and Bentley (6) used almost identical designs in experi- 
menting with visual-aids materials. Johnson, in his study concerned with 
the teaching of two geometry units, selected 12 schools at random from a 
defined population of schools. Among other requirements for inclusion in 
this population was the requirement that each school have two or more 
classes (sections) in plane geometry. One (or more) experiment(s) was 
carried out in each school by randomly assigning an experimental method 
to one class and the control method to the other. The same experimental 
method was not employed in every one of the 12 schools, but each 
experimental method was used in at least two schools randomly selected 
from among the 12. IQ measures, initial achievement test scores, and final 
achievement test scores were obtained for each subject, the first two being 
used as control variables, the latter as the criterion. In each of the 15 
experiments, a f-test was used to determine the significance of the mean 
difference in final achievement test scores. The test assumed simple random 
sampling, apparently disregarding possible systematic differences between 
the sections. For the two or more schools using the same experimental 
method, the several experimental groups (and the several control groups) 
were combined and regarded as simple random samples for the purpose 
of applying the Johnson-Neyman technic to the means of the combined 
groups. The combining procedure was defended on the basis of non- 
significant differences among the component variance estimates as revealed 
by the Welch-Nayer L-test and nonsignificant differences among the 
component means as revealed by simple analysis of variance. The Johnson- 
Neyman technic, discussed in detail by Johnson and Fay (25), involves 
(a) adjustment of the means of the pooled groups for differences in the 
control variables, (b) application of tests of significance to the differences 
between the adjusted means, and (c) definition (in terms of the control 
variables) of a population concerning which valid inferences could pre- 
sumably be drawn from the test of significance. Analysis of co-variance 
procedures were also applied to the data of the combined groups, apparently 
as a check on the significance tests made by the Johnson-Neyman method. 

In his introductory remarks Johnson states that, “the three requirements 


351 








REVIEW OF EDUCATIONAL RESEARCH Vol. XXI, No. 5 








for a scientifically designed experiment from which clear cut conclusions 
may be drawn are: (1) randomization, (2) replication, and (3) local 
control.” Randomization was used in his study both in the selection of 
schools and in the assignment of methods to sections, but not in the 
assignment of pupils to treatment groups. The replications (by schools) 
used in this experiment were not exploited. The process of pooling used 
assumes no interaction of methods and schools. However, in summarizing 
the results of the ¢-tests, Johnson pointed out that in some schools the 
experimental method proved superior to the control while in other schools 
the reverse was true. This indicates that methods by schools interactions 
were almost certainly present. Johnson’s design provides no opportunity 
to test the significance of these interactions or to make generalizations apply- 
ing to any particular population of schools. A “methods x schools” design, 
in which the interaction between methods and schools is a random effect 
(due to the random selection of schools) and hence a valid error term, 
would consequently have been more appropriate in this study. “Local 
control,” as explained by Johnson, refers to the device of comparing an 
experimental group with a control group in the same school. Since school 
differences are largely teacher differences, and since different teachers 
were used with the experimental and control methods in each school, the 
“control” achieved by this device is of doubtful importance. 

The practice of combining groups on the basis of nonsignificant out- 
comes of tests of significance is quite common in recent educational 
research. Failure to find significant differences between the means (or 
variances) of two groups does not prove that no difference exists. 
Statistical nonsignificance, depending as it does on the precision of the 
experiment, cannot reasonably be used as the sole criterion for pooling. 
The variability of a group which is composed of several school classes 
includes the variability of the class means. In an experiment involving 
several groups, each composed of a number of intact classes, the within- 
group mean squares for the various groups cannot be regarded as 
estimates of a common population variance. In such experiments, the use 
of the within-groups mean square as an error term is sometimes defended 
on the grounds that it is an “overconservative” estimate of error. This 
interpretation fails to recognize that Type II errors (the risk of which is 
increased by the “conservative” test) are often as serious as Type I errors. 
It also fails to recognize that the population to which the valid inferences 
may be drawn from the test of significance is, in general, a population in 
which the same teachers and class conditions are involved in exactly the 
same way as they are in the experimental situation. Such populations are 
obviously of extremely limited general interest. If interactions between 
methods and classes (teachers) are present, as was apparently the case in 
the experiments of Johnson and Bentley previously cited, the referent 
population is still further restricted. 

The foregoing discussion should not be construed as a blanket indict- 
ment of the practice of combining intact groups. Ash (3), in an experiment 
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comparing massed and spaced film presentations, assigned intact classes 
at random to the various treatments. In the subsequent analysis he 
regarded the treatment groups as simple random samples after showing 
that the differences among the means of the classes under any given treat- 
ment were not significant. Since the criterion used by Ash was the score 
on a test over the film content rather than some more general achievement 
measure which might reflect teacher differences, it may have been possible 
to argue on a priori grounds that differences among classes or interactions 
between treatments and classes were likely to be of no consequence. Hence 
the combining procedures used in this instance were probably justified. 


Experiments Involving Matching 


Koenker (27) carried out an experiment comparing two methods of 
preparing kindergarten children for work in arithmetic. The same two 
schools were represented in both the experimental and control groups but 
school differences were ignored on the basis of nonsignificant differences 
among school means and among school variances on both initial arith- 
metic achievement and IQ. This procedure seems particularly questionable 
since the criterion measure was gain in arithmetic achievement over a 
period of a school year. Interaction between the methods and the schools 
(teachers) on this criterion seems very likely. At the close of the experi- 
mental period, Koenker matched 27 experimental with 27 control pupils 
on the basis of initial arithmetic achievement and IQ and used the method 
involving the differences between matched pairs in applying a t-test to the 
difference in mean gain in arithmetic achievement. This method takes 
account of the correlation between the two groups. Koenker’s application 
of a test of homogeneity of variance to the distributions of gains for the 
experimental and control groups was unnecessary since the method of 
analysis used involved only the assumptions that the set of obtained 
differences are a random sample from a normally distributed population 
of such differences. 

Jones (26) carried out an experiment designed to compare a “standard” 
method of teaching certain subjects in Grade IV with an experimental 
method involving special attention to individual differences. The experi- 
mental group was composed of all the pupils in five selected classrooms 
in five different buildings in a particular city. The pupils in the control 
group were selected from all Grade IV pupils not exposed to the experi- 
mental treatment and represented 26 classrooms in 10 different buildings. 
The exact procedure used in matching the control and experimental 
groups was not clearly explained in the published report but apparently 
involved the pairing of subgroups on the basis of 10, chronological age, 
and initial achievement. The median values of the matching variables for 
the experimental and control groups were approximately equal. These 
groups were subdivided into three levels on the basis of IQ measures. 
but the subgroup means on initial achievement were not equal. Indeed, 
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the subgroup making the greatest gain was invariably the one with the 
lower initial mean. The criterion measure was net change in grade 
equivalent score on an achievement test. Differences between mean gains 
for the experimental and control groups (or subgroups) were tested by the 
t-technic, the standard errors involved in these tests being calculated by 
the direct method of pairing cases. No rationale was provided to justify 
the use of this procedure in spite of the fact that pupil-by-pupil matching 
was not employed. The set of differences obtained by pairing the experi- 
mental and control pupils in any given category of this study cannot be 
regarded as a simple random sample from any defined or definable popu- 
lation of such differences since the variability of any such set of differences 
is affected to some unknown extent by systematic differences between 
pupils from different classrooms. Furthermore, in the over-all comparison, 
the variability of the set of differences is inflated by possible interactions 
between the methods and ability levels. A design which recognizes the 
school class as the unit of sampling and which, therefore, gives due 
representation to class and teacher differences in the error term should 
have been employed. Such a design would also have permitted tests of 
significance to be applied to the interaction between methods and IQ levels 
and between methods and classes. 

Lueck (31) also used a pupil-by-pupil matching procedure in a methods 
experiment involving four high-school algebra classes. The experimental 
group was formed by matching, on the basis of initial achievement 
measures, pupils from the two classes taught by a special method with 
pupils from the two classes taught by a standard method. The criterion 
measure was final achievement. Lueck made appropriate use of a method 
suggested by Wilks (50) for calculating the standard error of the mean 
difference between criterion scores for groups which have been matched 
on a pupil-by-pupil basis. This method takes account of the fact that the 
matched groups are not simple random samples. However, Lueck’s design 
ignored possible class differences or class-by-methods interaction. Lueck 
also divided his groups into two ability levels and made separate criterion 
comparisons at each level. He found evidence of an interaction between 
the methods and the levels but did not test this effect for significance. 


Other Methods Experiments 


In a study concerned with the relative effects of three methods of train- 
ing on various reading and eye-movement measures, Glock (19) used 
both simple and factorial analysis of variance technics. Six remedial 
reading sections were randomly assigned to the methods, two to each of 
three methods. The ¢-test for related measures was used to test the null 
hypothesis regarding the mean gain within each section on each of several 
criteria. The published report contained reference to tests of significance 
of “over-all gain” for certain criteria. The meaning of this expression is 
not clear, but apparently the separate probabilities resulting from the 
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t-tests were combined by the x? procedure. This is justified only if the 
probabilities in question arise from independent experiments with the 
same factor, all experiments testing the same hypothesis. This condition 
does not seem to be satisfied in this study. The data involving two of the 
methods and two of the teachers were further treated factorially. In each 
of these analyses the interaction effects were tested against the within-cells 
mean square. On those criteria for which this interaction was nonsignifi- 
cant, the within-cells term was used to evaluate the main effects. This test 
takes into account only the random variations among individuals within 
the teacher-methods groups and limits the generalizations to the particular 
teachers involved. However, on criteria for which the interaction between 
methods and teachers proved to be significantly larger than the within- 
cells term, the interaction mean square was used as the error term for 
testing the methods and teacher effects. It may be observed that the 
interaction effect becomes an appropriate error term only if the two 
teachers involved in the experiment are assumed to be a random sample 
from a particular population of teachers to which it is desired to extend 
the generalizations. Glock apparently did not recognize this point, for 
had he been interested in thus extending his generalizations, this inter- 
action, whether significant or not, should have been used consistently as 
the error term. Freeburne (17), in % reading experiment employing a 
series of 2 x 3 factorial designs, alsé followed this ambiguous practice 
of using the interaction between methods and teachers as an error term 
only when it was significantly larger than the within-cells mean square. 

Heidgerken (21) used what she termed a “partially confounded factorial 
design” in comparing the effectiveness of various visual aids in a nursing 
arts unit. This study is cited simply as an example of the state of con- 
fusion existing among certain educational research workers with regard 
to experimental design and analysis. It is so inadequately reported that 
interpretation or critical analysis by a reviewer is quite impossible. This is 
particularly unfortunate in view of Heidgerken’s claims regarding the 
uniqueness and efficiency of her design. No table was provided to enable 
the reader to follow the analysis of variance employed, altho there is 
fairly general agreement at present on the use and format of such tables. 
Considerable space was unnecessarily given to computational formulas. 
The following statement is indicative of the confused state of this study: 
“As in all experimental work, this [stating hypotheses in the null form] 
means that one single crucial experiment with adverse results could 
disprove a million favorable experiments.” (21: 263) 

Mitchell (34) used three “unselected groups” of sixth-grade pupils in 
an experiment concerned with the effects of radio listening on silent read- 
ing achievement. Each group took a reading test under each of three 
experimental conditions. A different form of the criterion test was used 
with each treatment and the order of administration of the treatments was 
different for each group. Simple analysis of variance procedures was 
employed to compare the criterion means for groups of pupils classified 
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with respect to type of program (treatment), sex, and IQ level separately. 
In this experiment the possibilities of systematic group differences or of 
interaction between the treatments and the intact groups seem unlikely, 
so that the assumption of simple random sampling is probably justified. 
It is to be observed, however, that since the same test form was always 
associated with the same treatment these two effects are completely con- 
founded. Mitchell might have avoided this confounding, and at the same 
time controlled order effects, by using a 3 x 3 Latin square design. It is 
in such situations, granting that all interaction effects may be assumed 
to be nonexistent in the referent population, that the Latin square design 
is of greatest usefulness (14). 

Fitch, Drucker, and Norton (16) carried out an experiment designed to 
investigate the effects of weekly quizzing and attendance at optional dis- 
cussion sessions on student achievement in a government course. At the 
end of the semester, the experimental (quiz) group and the control 
(no quiz) group were cross-classified into four levels on the basis of the 
number of discussion groups attended during the semester. The criterion 
was course achievement, and analysis of co-variance was used to control 
differences in achievement during the preceding semester. Since the process 
of cross-classification resulted in extreme disproportionality among the 
cell frequencies, these investigators chose to regard their obtained cell 
frequencies as characteristic of those in the population to which they 
intended to generalize. This assumption is tantamount to maintaining that 
there are no sampling errors in the distribution of the cell frequencies of 
the experimental and control groups with respect to the categories of the 
other factor. Since this assumption is not defensible, the subsequent 
factorial analysis employed does not seem to be justified. The single- 
classification analyses with respect to the marginal factors are not affected 
by the disproportionality. It is also to be observed that even tho the 
experimental group was over twice as large as the control group, there 
is no apparent necessity for reducing the size of the experimental group 
as was done in this instance. This study was exceptionally well reported. 





Observational Research 





The studies reviewed in this section are characterized by the fact that 
the crucial comparisons are made between groups which have been 
identified by the investigator on the basis of existent data. This is in 
contrast to the studies cited in the preceding section which involved 
analysis of results produced in subjects during the course of an experiment 
involving groups which had been formed by the experimenter for the 
express purpose of administering different experimental treatments. 

One of the most common problems in research of this “investigatory” 
character is that of obtaining sufficient control of concomitant factors in 
comparisons involving any single factor. Groups which have been identified 
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on the basis of differences with reference to a given characteristic will 
also differ to some extent with respect to any and all other characteristics 
which are related to the given one. The finding of significant differences 
among the mean criterion scores of the aforementioned groups does not 
necessarily imply that a causal relationship exists between the factor used 
to identify the groups and the criterion measure. Clearly, the possibility 
that the observed differences are due to some unknown extent to factors 
related to the given factor makes it extremely difficult to draw unambiguous 
conclusions from the tests of significance employed. 


Interrelated Classifications 


Anderson (1) carried out an elaborate investigation in 56 randomly 
selected schools in an attempt to identify the instructional factors which 
might account for differences in pupil achievement in high-school science. 
He formed groups of classes on the basis of 13 factors. For example, 
he compared classes taught by teachers with master’s degrees with classes 
taught by teachers without this degree; he compared classes which used 
laboratory manuals with classes which did not use this device; and he 
compared classes taught by teachers from universities with classes taught 
by teachers from teachers colleges and from private colleges. The school 
classes within each group were pooled on the basis of nonsignificant 
differences among their means and variances, and the differences among 
the group means were then tested for significance. The criterion measure 
was course achievement, and intelligence and initial achievement effects 
were controlled by the method of analysis of co-variance. It is clear that 
most of the classificatory factors used in the investigation are substantially 
related. For example, teachers with master’s degrees tend also to be those 
who attended universities; teachers with master’s degrees tend also to be 
older and more experienced than teachers without this degree. Criterion 
differences between the degree and nondegree groups may thus be due to 
age or experience rather than to the degree. The comparisons made on the 
basis of each of the 13 factors investigated by Anderson involved a 
different subset of school classes drawn from the original 56. For these 
and other reasons, the fact that the observed difference reached a higher 
level of significance for one classificatory factor than for another does not 
mean that the first factor is more “important to pupil achievement than 
the second. Altho Anderson did not explicitly draw such conclusions, his 
method of tabulating the results seems to imply that this is the case. 

Neidt and Fritz (36) classified college students by sex, age, religious 
preference, political preference, marital status, educational level, and 
father’s occupation. They attempted to determine the relation of these 
characteristics to cynicism as measured by a particular test. They formed 
a series of factorial designs, each involving sex and one of the other 
characteristics as the factors, and tested the main effects and the inter- 
actions by the method of analysis of variance. These investigators “con- 
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trolled” sex in such comparisons by adjusting for nonproportional 
frequencies by a method suggested by Snedecor but not clearly identified 
in the report. In this study, as in Anderson’s, it is obvious that most of 
the factors are interrelated and the fact that groups classified on one 
factor show significant differences does not necessarily imply that the 
factor in question is intrinsically related to the criterion. Furthermore, 
the validity of the method of “controlling” the sex factor is by no means 
clearly established. 


Control by the Use of Factorial Designs 


The use of factorial designs to control several factors in the comparisons 
involving still another factor offers some distinct possibilities. This pro- 
cedure can be quite fruitful if most of the interactions may be presumed 
to be negligible. If interactions exist, the population concerning which 
valid inferences may be drawn is one that is stratified with respect to the 
various factors in the same manner as is the sample; that is, if a certain 
factor interacts appreciably with several others, the population for which 
the main effect of this factor is tested for significance may be so unlike 
any real and meaningful population that the results are of little value. 
This point may be illuminated by considering the following study. 

Remmers and Drucker (39) used a six-factor design involving the 
following classifications: high-school grade, geographic region, religion, 
sex, size of community, and home environment. The investigation was 
concerned with pupil attitudes as revealed by an opinion questionnaire. 
Twenty-five pupils were randomly selected from the population of pupils 
corresponding to each of the 64 cells in the multiple-classification table. 
All of the interaction mean squares except one, sex by grade, proved to 
be nonsignificant, when tested against the within-cells mean square. 
Assuming the nonsignificant interactions tu be negligible, Remmers and 
Drucker pooled them with the within-cells term to form the error term 
for testing the main effects. (In view of the precision of the experiment, 
this pooling procedure seems to be justified in this case.) Under the 
conditions which obtained in this experiment, the referent population is 
one in which the sexes and the grades are equally represented. The repre- 
sentation with respect to the other four (noninteracting) factors is of no 
consequence. Such a population is readily conceivable, altho it was not 
clearly described in the report. 

An additional point of some general interest may be raised in con- 
nection with this study. Remmers and Drucker used the significant sex-by- 
grade interaction as an error term for testing the main effect of sex and 
of grade. The use of a significant interaction as an error term is justified 
in general only when the interaction effects may be regarded as random 
effects. This situation exists only when the categories in one (or both) 
of the marginal classifications may be regarded as a random sample from 
a population of such categories. Thus in a methods by schools design the 
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interaction term is a random effect, and hence a legitimate error term, if 
the schools involved are a random sample from a defined population of 
schools. Such an error term involves not only chance assignment of 
subjects to methods but also chance selection of schools. Thus it is more 
“comprehensive” than the error terms involving only differences among 
individuals. There is a very common tendency in recent educational 
research to characterize as “significant” a difference which cannot reason- 
ably be attributed to “chance,” without explaining what chance variations 
are involved. 


Control by Matching 


Studies by Robinson (42) and McGinnis (33) illustrate the use of 
matching procedures for obtaining control in observational studies. 
Robinson compared the academic performance of participants and non- 
participants in a remedial reading program. The two groups had approxi- 
mately equal means on initial measures of reading ability, college 
aptitude and intelligence. At the close of the experimental period, Robinson 
applied the t-test for unrelated measures to the difference between the 
mean grade-point averages of the two groups. He found a difference 
which was significant at about the 15 percent level and in favor of the 
participant group. He then carried out a pupil-by-pupil matching procedure 
on the basis of deciles in the distribution of the control variables, and 
applied the ¢-test for related measures to the mean criterion difference. In 
this instance the difference was significant at about the 10 percent level, 
again in favor of the participant group. On the basis of these two results, 
Robinson concluded that there is reason to believe that participation in 
the remedial reading program did have a positive effect on academic 
achievement. 

On the whole, the procedure used by Robinson seems to be justified, 
altho there are some difficulties involved in comparing the results of the 
two tests due to differences in sampling errors and to differences in the 
referent populations. The use of the ¢-test for unrelated measures in com- 
paring groups which are more alike than simple random samples would be, 
while overconservative, may perhaps be defended on grounds of compu- 
tational convenience. Certainly, had Robinson found a significant differ- 
ence between criterion means on the basis of the original matching, the 
laborious pupil-by-pupil matching would have been unnecessary. Studies 
by McGinnis (33) and Shimberg (45) also illustrate the use of the ¢-test 
for unrelated measures in comparing groups that have been matched only 
on the basis of means or medians. While it is obvious that this procedure 
may lead to Type II errors, other more appropriate technics have not been 
adequately investigated. Lindquist (30) has made some suggestions in 
this regard but further research is vitally needed. 

The study by Anderson previously cited, and a study by Downie, 
Troyer, and Pace (12) provide illustrations of the use of analysis of 
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co-variance to control concomitant variables in observational research. 
The latter study involved a questionable choice of error terms since groups 
composed of several intact curriculum groups were treated as simple 
random samples. 


Research Involving Identification of Factors 


Beckley (5) compared the gains of eight groups on several criteria 
related to achievement in retail selling. The groups were formed by cross- 
classification on three variables: experience, retail training, and years of 
college. In the first part of his study, Beckley disregarded the college 
classification and designated the group with neither experience nor train- 
ing as a control group. He obtained the correlation between IQ and retail 
achievement score for this group and determined the regression equation 
of achievement scores on IQ. He then predicted a criterion score for each 
subject in each of the experimental groups (experience and no training, 
experience and training, no experience and training) and applied ¢-tests 
to the differences between predicted and obtained mean achievement 
scores for each group. His claim that this procedure, due to Peters, was 
superior to the usual co-variance technic seems to be justified in this 
situation. Anderson (2) also made use of this regression procedure. 

In the second part of the study, Beckley attempts to use the unbiased 
correlation ratio (e*) to describe the “strength of relationship” between 
each of the factors and the achievement test scores. Any index of corre- 
lation, such as « or r, is meaningful as a measure of the strength of the 
relationship between two variables only if based on a sample representa- 
tive of the complete distributions of both variables for a specified 
population. Quite obviously, if one wanted to determine the “strength of 
relationship” between height and weight for the population of adult 
American males, he would not base his computation of r or ¢ on a sample 
consisting in part of individuals below 5 feet in height and in part of 
individuals between 5 feet 9 inches and 6 feet in height, with no other 
heights represented. Yet this is exactly analogous to what Beckley does 
when he computes a measure of strength of relationship between his 
“college” variable and the test scores for a sample containing only 
individuals with either two or four years of college attendance. One 
wonders what the strength of relationship might have been had the two 
categories employed been “no college” and “four years of college” or 
“five months of college” and “six months of college.” Beckley’s use of 
e” with the training and experience factors is not subject to criticism on 
these grounds, since the “training” and “no training” categories pre- 
sumably do account for the complete distribution of training; but he fails 
to specify very clearly for just what real and meaningful population his 
sample provides a representative distribution of training. Furthermore, 
the use of ce? assumes a normal distribution of training for the population. 
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It seems likely that this assumption is far from satisfied. Incidentally, the 
Beckley study provides a good illustration of the weakness of Peters and 
Van Voorhis’ claims (37: 353-57) for the advantages of the e* technic over 
the methods of analysis of variance. 

Tate (47) used a wide variety of analysis technics in a well executed 
and excellently reported investigation of individual differences in speed of 
response to mental test items. Four groups of items were prepared, each 
involving a different “mental function,” e.g., arithmetic reasoning, number 
series, etc. All of the items were administered to a group of 36 pupils and 
a record obtained of the time spent by each pupil on each item. Tate 
found that the distributions of time scores for each subject were extremely 
heterogeneous in variability and were consistently positively skewed. 
He applied a logarithmic transformation to the raw time scores and 
showed that the transformed score distributions satisfied the assumptions 
of normality and homogeneity within acceptable limits. Separate time 
scores were obtained for each subject on such items for both correct and 
incorrect responses. The items were also divided into three levels of 
difficulty. For the items in each of the four groups, a three-factor analysis 
of variance involving subjects, levels of difficulty, and accuracies was 
performed. In such an analysis, the interaction of a given factor with 
subjects is a valid error term for testing the main effect of that factor, 
since the subjects are a random sample from a defined population of 
subjects. Tate, however, pooled all of the interactions on the basis of 
nonsignificant differences among the component mean squares. This 
procedure involves the assumption that all interactions in the population 
are in fact negligible, an assumption that may be untenable on a priori 
grounds even tho statistical tests fail to reveal significant differences— 
particularly if the power of the tests is low. Tate also obtained a measure 
of the reliability of his speed measure when freed from accuracy effects 
by using some ingenious regression technics to determine estimated true 
speed scores for each subject. The details of this procedure, and of his 
subsequent correlation analysis, cannot be elaborated here but educational 
researchers would profit from a perusal of this investigation. 

The use of transformations of various types, as suggested in Tate’s 
study, should perhaps be considered more frequently in some types of 
educational research. An experiment by Walter and Marzolf (48) involved 
data which apparently violated both of the fundamental assumptions of 
normality and homogeneity of variance which underlie analysis of 
variance procedures. The data of this experiment appear particularly 
amenable to transformation. It is generally agreed that the sensitivity or 
power of tests of significance is adversely affected when the assumptions 
underlying those tests are not satisfied. The matter of appropriate inter- 
pretation of such tests, or of tests on transformed data, is in need of 
further research of both a theoretical and a practical nature (4, 10, 35). 

A rather unique technic was employed by Campbell and Mohr (8) in 
investigating the effect of ordinal position of items in a 16-item checklist 
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of radio program types. They used a 16 x 16 Latin square with ordinal 
position of items as the row factor, items as the column factor, and forms 
of the checklist as the Latin square factor. Their criterion was “frequency 
of choice” of each item under instructions to check the five most-liked 
programs (items). With 1280 subjects each checking five items out of 16, 
the expected frequency in each row was % ¢ x 1280, or 400. These experi- 
menters applied a x? test of goodness of fit to the observed row totals, 
achieving what appears to be readily interpretable results. However, the 
x” test is based on the assumption of independence of the individual 
observation, a condition clearly not satisfied in this application. The 
larger the number of items checked by each subject, the more nearly must 
the observed row totals approach the expected row totals. Any valid test 
of departure of the row totals from expectation must take into account 
the dependence of the variability of the observed row totals on the number 
of choices made by each subject. 

Strother (46) used a rather complicated design in investigating the role 
of muscle aciion in interpretative reading. Six different factors were 
involved in the design. Strother recorded only one measure for each cell of 
the multiple-classification table and hence was unable to apply tests of 
significance to any of the interactions. He followed the practice of pooling 
all mean squares except those of the main effects into one residual “error” 
term. This procedure was particularly inappropriate in this experiment 
since some of the crucial comparisons involved between-subject differences 
while others involved only within-subject differences. Separate error 
terms should generally be employed for testing these distinct types of 
comparisons. 

Dressel and Matteson (13) reported a very ingenious experiment con- 
cerned with the effect of client participation in a counseling situation on 
the gain made by that client on a test of “self-understanding.” Forty 
interviews were handled by seven different counselors. Each interview was 
independently rated, by each of four judges, on the extent of client 
participation. Analysis of variance of the rating scores revealed significant 
differences among counselors in the extent of participation elicited. 
Analysis of variance of the gains in self-understanding showed that 
significant differences existed between clients handled by different coun- 
selors. Analysis of co-variance was employed to remove the effects of 
differences in client participation associated with counselors from differ- 
ences in gains in self-understanding among clients handled by the same 
counselor. The resulting analysis showed that participation by the client 
did indeed increase the gain in self-understanding made by the client 
during the interview. Dressel and Matteson used correlational technics 
to further investigate the extent of this relationship. The study provides 
an excellent illustration of the manner in which analysis of variance and 
co-variance may be used to isolate the effects of certain factors and to 
indicate the direction for further research. 
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Errors in Applications of Simple t-Tests 


The simple t-test has been subjected to considerable abuse. The require- 
ment of homogeneity of variance was apparently overlooked by Cruick- 
shank (11) in an investigation comparing the problem-solving ability of 
normal and mentally retarded children. The variability of the scores of 
the normal children on the easy problems was quite restricted. When the 
condition of homogeneity is not satisfied, most researchers have turned to 
the Behrens-Fisher test. No applications of the somewhat similar Cochran 
and Cox procedure have come to the attention of these reviewers. Perhaps 
the most common error in the use of the t-technic is the practice of report- 
ing the exact significance level reached by each of a series of t-tests. 
Results presented in this way are likely to be taken by the casual reader as 
indicative of the relative potencies of the various factors or of the relative 
“importance” of the various criteria. Conclusions concerning relative 
potency should be drawn from the obtained means, which are the best 
available estimates of the population means regardless of their standard 
errors. If the means are nut expressed in comparable terms, standard 
methods of rendering them comparable should be employed. The t-ratio 
is not a technic for rendering differences comparable. 


Pratt (38) compared the mean mental age, and mean readiness scores 
of first-grade pupils who had attended kindergarten with first-graders who 
had not attended kindergarten and with first-graders who were repeating 
the grade. Since the groups were not equal in size, the t-tests were not 
equally precise. Interpretations apparently based on the comparisons of 
the t-values are certainly unjustified in this case. It seems advisable in 
most cases for the experimenter to set a predetermined level of significance 
for all tests, and to simply report categorically which ¢’s are significant 
according to this criterion. 


Correlation Studies 


Correlational technics continue to find widespread application in edu- 
cational research. The studies cited here have been chosen as illustrative 
of common applications of these technics. 


Riesch (41), in a study of some factors in pupil growth, tested 258 
pupils representing rural and urban school systems. The tests, representing 
five areas of personality and achievement, were administered before and 
after a 12-week period and various sets of intercorrelations were obtained 
between pretest, post-test and gain score for the rural and urban pupils 
separately. The large number of coefficients reported are difficult to 
interpret since the tests undoubtedly overlap considerably in their factorial 
composition. 


Lennon (29), in investigating some problems concerned with the devel- 
opment of test norms, drew a random sample of 25 percent of the grade- 
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school pupils from each of 70 communities and obtained for each pupil 
a measure of intelligence and a measure of achievement in each of several 
subjectmatter areas. He then determined the mean IQ and mean achieve- 
ment score for the sample of pupils in each grade in each community 
and obtained the correlation between these means at each grade level. 
Lennon gave adequate consideration to the possibility that increasing 
variability of the mean achievement test scores from grade to grade 
might account for the observed increase in correlation from the lower 
to the upper grades. Wesley, Corey, and Stewart (49), in studying the 
relationship between interests and ability, apparently did not consider 
the effect on the correlation coefficient of differing test reliabilities and 
differing test variances. 


Ewers (15) made appropriate use of the ,” test of independence to 
determine whether or not any relationship existed between certain auditory 
variables and the scores on two reading tests. If this test indicated a 
relationship, she proceeded to determine the correlation between each test 
and the auditory variable under investigation. 


Shaw (44) used a wide variety of correlation technics in examining the 
relation of scores on the Chicago Primary Mental Abilities Test to high- 
school achievement. The results of this study were unusually well reported. 


Hieronymus (22) made use of analysis of co-variance to obtain partial 
correlation coefficients relating socio-economic status to socio-economic 
expectation, holding IQ constant. He presented an interesting discussion 
relating to the use of within-schools correlations and total correlations in 


studies involving socio-economic factors among pupils from different 
schools. 


Honzik, MacFarlane, and Allen (23) reported an interesting longitudinal 
correlational study. These investigators correlated the scores of a group 
of children on mental tests administered at various times thruout a period 
of 16 years. They paid particular attention to possible biases which are 
likely to arise in such studies due to shifting sample components. 


Carlson (9), in a study of the relationship of speed of reading to 
accuracy of comprehension, treated pupils from eight schools as a single 
group. He justified this combining procedure on the basis of nonsignificant 
differences among the means and variances of the component school 
groups. Carlson might better have obtained the total, between-schools and 
within-schools correlations between speed and accuracy and disregarded 
school differences on the basis of a negligible between-schools correlation, 
rather than on the basis of nonsignificant differences between school 
intelligence measures. 


Illustrative of the large number of studies making fairly standard use 
of simple, partial, and multiple correlation procedures are those by Berdie 
(7), Greenblatt (20), Furst (18), Krathwohl (28), McClanahan and 
Morgan (32), Resnick (40), and Schultz (43). 


364 


Re eer EM te AEE Sh OO | Ps iced Nata et ee et Ie 





December 1951 EXPERIMENTAL DESIGN AND ANALYSIS 





Bibliography 


. Anperson, Kennetu E. “A Frontal Attack on the Basic Problem in Evaluation: 
The Achievement of the Objectives of Instruction in Specific Areas.” Journal of 
Experimental Education 18: 163-74; March 1950. 

. Anperson, Rosert H. “The Influence of an In-Service Improvement Program 
upon Teacher Test Behavior and Classroom Program.” Journal of Educational 
Research 44: 205-15; November 1950. 

. Asn, Pup. “The Relative Effectiveness of Massed Versus Spaced Film Presenta- 
tion.” Journal of Educational Psychology 41: 19-30; January 1950. 

; = Maurice S. “The Use of Transformations.” Biometrics 3: 39-52; March 
1947, 

. Becxiey, Donato K. “A Scientific Appraisal of Professional Education for Busi- 
ness.” Journal of Educational Psychology 40: 174-88; March 1949, 

. Benrtey, Raves R. “An Experimental Evaluation of the Relative Effectiveness of 
Certain Audio-Visual Aids in Vocational Agriculture.” Journal of Experimental 
Education 17: 373-81; March 1949. 

. Berpre, Ratpu F. “The Differential Aptitude Tests as Predictors in Engineering 
Training.” Journal of Educational Psychology 42: 114-23; February 1951. 

. Campsett, Donatp T., and Mour, Puitup J. “The Effect of Ordinal Position 
upon Responses to Items in a Check List.” Journal of Applied Psychology 
34: 62-67; February 1950. 

. Cartson, Tuorsten R. “The Relationship Between Speed and Accuracy of 
Comprehension.” Journal of Educational Research 42: 500-12; March 1949, 

. Cocuran, Wittram G. “Some Consequences When the Assumptions for the 
Analysis of Variance Are Not Satisfied.” Biometrics 3: 22-38; March 1947. 

. CrurcxsHank, Wittiam R, “Arithmetic Ability of Mentally Retarded Children: 
I. Ability To Differentiate Extraneous Materials from Needed Arithmetical 
Facts.” Journal of Educational Research 42: 161-70; November 1948. 

2. Downie, Norvitte M.; Troyer, Maurice E.; and Pace, C. Rosert. “The Knowl- 
edge of General Education of a Sample of Syracuse University Students As 
Revealed by the Cooperative General Culture Test and the Time Magazine 
Current 4 Fairs Test.” Educational and Psychological Measurement 10: 294-306; 
Summer 1950. 

. Dresser, Paut L., and Matteson, Ross W. “The Effect of Client Participation 
in Test Interpretation.” Educational and Psychological Measurement 10: 
693-706; Winter 1950. 

. Epwarps, Aten L. “Homogeneity of Variance and the Latin Square Design.” 
Psychological Bulletin 47: 118-29; March 1950. 

. Ewers, Dorornea W. F. “Relations Between Auditory Abilities and Reading 
Abilities: A Problem in Psychometrics.” Journal of Experimental Education 
18: 239-62; March 1950. 

. Frren, Micprep A.; Drucker, Artour J.; and Norton, J. A., Jr. “Frequent 
Testing As a Motivating Factor in Large Lecture Classes.” Journal of Educa- 
tional Psychology 42: 1-20; January 1951. 

. Freesurne, Cecir M. “The Influence of Training in Perceptual Span and Per- 
ceptual Speed upon Reading Ability.” Journal of Educational Psychology 
40: 321-52; October 1949. 

. Furst, Epwarp. “Effect of the Organization of Learning Experiences upon the 
Organization of Learning Outcomes: I. Study of the Problem by Means of 
Correlation Analysis.” Journal of Experimental Education 18: 215-28; March 
1950. 

. Grocx, Marvin D. “The Effect upon Eye-Movements and Reading Rate at the 
College Level of Three Methods of Training.” Journal of Educational Psy- 
chology 40: 93-106; February 1949, 

. Greensiart, E. L. “Relationship of Mental Health and Social Status.” Journal of 
Educational Research 44: 193-204; November 1950. 

. Hemecerxen, Loretta E. “An Experimental Study To Measure the Contribution 
of Motion Pictures and Slide-Films to Learning Certain Units in the Course 
Introduction to Nursing Arts.” Journal of Experimental Education 17: 261-93; 
December 1948. 


365 





Re" 1Ew oF EpucaTIONAL RESEARCH Vol. XXI, No. 5 





22. 


23. 


24. 


25. 
26. 
27. 
28. 


29. 


30. 
31. 
32. 


33. 


34, 


35. 
36. 


37. 
38. 


39. 


40. 


41. 
42. 
43. 


Hieronymus, Atsert N. “A Study of Social Class Motivation: Relationships 
Between Anxiety for Education and Certain Socio-Economic and Intellectual 
Variables.” Journal of Educational Psychology 42: 193-205; April 1951. 

Honzik, Marsorte P.; MacFartane, Jean W.; and ALLen, Lucitte “The 
Stability of Mental Test Performance Between Two and Eighteen Years.” 
Journal of Experimental Education 17: 309-24; December 1948. 

Jounson, Donovan A. “An Experimental Study of the Effectiveness of Films 
and Filmstrips in Teaching Geometry.” Journal of Experimental Education 
17: 363-72; March 1949. 

Jounson, Parmer O., and Fay, Leo C. “The Johnson-Neyman Technique, Its 
Theory and Application.” Psychometrika 15: 349-67; December 1950. 

Jones, Datsy M. “An Experiment in Adaptation to Individual Differences.” 
Journal of Educational Psychology 39: 257-72; May 1948. 

Koenxer, Rosert H. “Arithmetic Readiness at the Kindergarten Level.” Journal 
of Educational Research 42: 218-23; November 1948. 

KratHwout, Witiiam C. “Relative Contributions of Vocabulary and an Index 
of Industriousness for English to Achievement in English.” Journal of Educa- 
tional Psychology 42: 97-104; February 1951. 

Lennon, Rocer T. “The Relation Between Intelligence and Achievement Test 
Results for a Group of Communities.” Journal of Educational Psychology 
41: 301-308; May 1950. 

Linpquist, Everet F. “The Significance of a Difference Between ‘Matched’ 
Groups.” Journal of Educational Psychology 22: 197-204; March 1931. 

Lueck, Wiiuiam R. “An Experiment in Writing Algebraic Equations.” Journal 
of Educational Research 42: 132-37; October 1948. 

McCiananan, WALTER R., and Morcan, Davin H. “Use of Standard Tests in 
Counseling Engineering Students in College.” Journal of Educational Psychology 
39: 491-501; December 1948. 

McGinnis, Dorotuy J. “Corrective Reading: A Means of Increasing Scholastic 
Attainment at the College Level.” Journal of Educational Psychology 42: 166-73; 
March 1951. 

MitcHe.t, Apette H. “The Effect of Radio Programs on Silent Reading Achieve- 
ment of Ninety-One Sixth Grade Students.” Journal of Educational Research 
42: 460-70; February 1949. 

Muetier, Conran G. “Numerical Transformations in the Analysis of Experi- 
mental Data.” Psychological Bulletin 46: 198-223; May 1949. 

Newt, Cuartes O., and Fritz, Martin F. “Relation of Cynicism to Certain 
Student Characteristics.” Educational and Psychelegical Measurement 10: 
712-17; Winter 1950. 

Perers, Cuartes C., and Van Voornis, WALTER R. Statistical Procedures and 
Their Mathematical Bases. New York: McGraw-Hill Book Co., 1940. 516 p. 

Pratt, Wituts E. “A Study of the Differences in tiic Prediction of Reading 
Success of Kindergarten and Non-Kindergarten Children.” Journal of Educa- 
tional Research 42: 525-33; March 1949. 

Remmers, Herman H., and Drucker, Artuur J. “Teen-Agers’ Attitudes Toward 
Problems of Child Management.” Journal of Educational Psychology 42: 105-13; 
February 1951. 

Resnick, Josepu. “A Study of Some Relationships Between High School Grades 
and Certain Aspects of Adjustment.” Journal of Educational Research 44: 
321-40; January 1951. 

Riescn, Kennetu P. “A Study of Some Factors on Pupil Growth.” Journal of 
Experimental Education 18: 31-55; September 1949. 

Rosrnson, Harvey A. “A Note on the Evaluation of C. lege Remedial Reading 
Courses.” Journal of Educational Psychology 41: 83-96; February 1950. 

Scuuttz, Dovetas G. “The Relationship Between Scores on the Science Test of 
the Medical College Admission Test and Amount of Training in Biology, 
Chemistry, and Physics.” Educational and Psychological Measurement 11: 
138-50; Spring 1951. 


. SuHaw, Duane C. “A Study of the Relationships Between Thurstone Primary 


Mental Abilities and High School Achievement.” Journal of Educational Psy- 
chology 40: 239-49; April 1949. 


. Survperc, Benyamin. “Information and Attitudes Toward World Affairs.” 


Journal of Educational Psychology 40: 206-22; April 1949. 


366 








December 1951 EXPERIMENTAL DESIGN AND ANALYSIS 





46. SrrorHer, Georce B. “The Role of Muscle Action in Interpretative Reading.” 
Journal of General Psychology 41: 3-20; July 1949. 

47. Tare, Merte W. “Individual Differences in Speed of Response to Mental Test 
Materials of Varying Degrees of Difficulty.” Educational and Psychological 
Measurement 8: 353-74; Summer 1948. 

48. Wavrer, Lowett M., and Marzotr, Stanuey S. “The Relation of Sex, Age and 
School Achievement to Levels of Aspiration.” Journal of Educational Psy- 
chology 42: 285-92; May 1951. 

49, Westey, S. M.; Corey, Dovetas Q.; and Stewart, Barsara M. “The Intra- 
Individual Relationship Between Interest and Ability.” Journal of Applied 
Psychology 34: 193-97; June 1950. 

50. Wimxs, Samuet S. “The Standard Error of the Means of ‘Matched’ Sampies.” 
Journal of Educational Psychology 22: 205-208; March 1931. 


ee ee 





367 





CHAPTER IV 


Factor Analysis in Educational Research 


JOHN B. CARROLL and ROBERT F. SCHWEIKER 


Sorter treatments of factor methods and results have appeared passim 
in previous issues of the REviEw in the past 12 years. In the cycle dealing 
with research methods in education, the following articles deserve men- 
tion: Vol. XII, December 1942, Chapter VII (Fattu) ; Vol XV, December 
1945, Chapter VII (Lorge); Vol. XVIII, December 1948, Chapter VIII 
(Fattu). Much greater attention has been paid to factor analysis in the 
cycle dealing with educational and psychological tests; the following 
articles may be consulted: Vol. XI, February 1941, Chapters II (Stuit), 
V (Traxler), and VIII (Flanagan); Vol. XIV, February 1944, Chapters 
IV (Sells), V (Traxler), and X (Conrad); Vol. XVII, February 1947, 
Chapters II (Findley, Turnbull, and Conrad), III (Carter), IV (Ellis), 
and VIII (Travers); and Vol. XX, February 1950, Chapters I (Segel and 
Gerberich), II (Cornell and Gillette), III (Stuit), IV (Traxler and 
Jacobs), and IX (Ebel). 

Since this is the first full-scale treatment of factor analysis which has 
appeared in the Review since the December 1939 issue it seems wise to 
consider as a whole the work that has been done since that time, with 
particular attention to more recent developments. Our purpose is to present 
a basis for the evaluation of factor analysis as a research technic, with 
proper concern for results and applications. 


General Reviews and Bibliographies 


Wolfle’s (169) comprehensive and easily understandable survey of the 
whole field of factor analysis continues to be the best starting point for 
the student; Vernon’s book (160), concerned largely with the factor 
analysis of ability tests, is a valuable and up-to-date supplement. 

There is no comprehensive and up-to-date bibliography devoted to 
factor analysis research of all kinds. Swineford and Holzinger (144) 
continued their series of selected references; bibliographical aids by 
Goheen and Kavruck (61) and Buros (12) may also be consulted. 


Methods of Factor Analysis 


Factor analysis can be crudely described as an extension of the corre- 
lational method. When several variables are found to be rather highly 
correlated, it may be inferred that they are connected in some way, perhaps 
by a common underlying variable which is not immediately present in 
the measurements but which nevertheless would account for them to a 
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major extent. On the other hand, variables showing no relationship are 
assumed to be measuring different things. When an investigator is faced 
with a fairly large number of variables it becomes difficult to disentangle 
the relationships inherent in them. It is the purpose of factor methods 
to enable the investigator to find relatively simple ways of describing and 
accounting for the relationships between variables. 

In order to do this, all present methods of factor analysis assume a 
general mathematical model represented by the equation 


Sji = Aj, Xi + Ajo Xai +...+ Aim Xmis 


which states that the standard score (s) of person i on variable j is the 
sum of his standard scores (x) on factors 1 to m, weighted by the co- 
efficients a which pertain to the loadings of the variable on each of the 
factors 1 to m. (In some applications of factor analysis, persons and 
variables are transposed, or “person” could be replaced by some other 
class of thing.) Whether or not this model has any ultimate justification, 
it seems to be convenient and useful, altho Jeffress (97) criticized the 
assumption of invariant factor loadings thruout different persons. The 
various factor methods differ fundamentally only in the criteria which 
they set for determining factors and factor weights. In most cases the 
analysis starts from an empirical correlation matrix or a co-variance 
matrix, and ends with some type of factor matrix F, such that the matrix 
product FF’ will equal the original matrix, with or without adjustments. 


Principal Component Methods 


The methods developed by Hotelling (90) and Kelley (104) still com- 
mand the greatest respect from the standpoint of mathematical elegance. 
These methods are concerned with finding a unique set of independent 
factors, whose number equals the number of variables, which will com- 
pletely account for the variances and co-variances in a body of data. 
The proportion of total variance accounted for by each principal com- 
ponent is computed, and it is possible to compute the statistical significance 
of the results. Lawley (110) and Emmett (48) discussed a variant of the 
method using maximum likelihood to estimate the common factor 
variances. Principal component methods, when used to analyze only the 
common factor variance of a multivariate series, give results which can 
be used as the starting point for other factor methods, as Holzinger and 
Harman (87) showed. The methods are seldom used today, partly 
because of their laboriousness (seldom have more than a dozen variables 
been studied at a time) and partly because of the fact that the results are 
difficult to interpret directly. They have the advantage that there is no 
room for the exercise of subjective judgment, unless one desires to weight 
the initial variances, as David (42) did in his study of reading tests. 
Tucker (155) and Richardson (128) proposed simplified principal com- 


ponent computations. 
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The Two-Factor Theory 


The two-factor theory, most recently defended by Spearman (136) and 
Spearman and Jones (137), postulates that the data can most usefully be 
factored by assuming that there is only one general factor (G), plus a 
specific factor associated with each variable alone. It is now apparent, 
however, that the two-factor theory was never intended to rule out the 
possibility of group factors over and above the general factor; instead, 
it was offered as a basis for studying the nature of the general factor 
when a correlation matrix was found to have hierarchical properties. 
Indeed, Spearman and Jones speak of three general factors: G, p, (perse- 
veration), and O (oscillation), together with a number of group factors 
such as verbal ability, mechanical ability, etc. Properly speaking, the two- 
factor theory is not so much a factor method as a theoretical framework 
for the interpretation of results. 


Holzinger’s Group-Factor Methods 


The methods espoused by Holzinger and Harman in their book (87), 
and in other papers by Holzinger (83, 84) assume that in the usual case 
of mental ability tests there is a general factor common to all the tests, 
a relatively small number of independent but sometimes overlapping group 
factors associated with small groups of tests, and specific factors for each 
of the tests. The communality (i.e., common factor variance) of a test is 
that part of its variance is accounted for by general and group factors. 
This “bi-factor” theory immediately suggests a procedure for factoring: 
estimate the communalities of the tests and extract the general factor; 
from the residuals extract group factors for clusters of highly correlated 
tests. The method of B-coefficients is used to determine the clusters of tests 
on which the group factors are to be based, this technic being somewhat 
similar in purpose to Tryon’s (154) method of cluster-analysis. The result 
is a bi-factor matrix showing the weights of each test on each factor. 
Holzinger and Harman also present methods for the “uni-factor” case, 
where there is no general factor, and for nonorthogonal solutions in which 
the factors are not independent. They carefully distinguish between the 
pattern matrix (loadings of the tests on the factors) and the structure 
matrix (correlations of the tests with the factor scores) ; these matrices are 
identical only when factors are uncorrelated. The bi-factor solution is 
relatively easy to compute and allows relatively little scope for subjective 
judgment; the statistical significance of some results can be assessed. 


Multiple-Factor Methods 


In 1947 Thurstone (149) published a “development and expansion” of 
his earlier book which provides a full exposition of his widely used 
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multiple-factor methods. Thurstone claimed to develop a method of the 
utmost generality by not being concerned about distinctions between 
general and group factors. Instead, the number of “primary” factors 
accounting for the common factor variance in a set of data is determined, 
in theory, as a function of the rank of the correlation matrix; the emphasis 
is on reducing the variables to the smallest possible number of common 
factors. There are two chief methods for reducing the correlation matrix 
to an orthogonal factor matrix: the well-known centroid method, and the 
newer multiple-group method (149, 150). This factor matrix, however, 
being one of an infinity of possible solutions, is regarded simply as the 
starting point for the rotation of the axes so as to get a more meaningful 
factor matrix according to the principle of “simple structure.” Simple 
structure is obtained when a maximum number of test vectors lie on or 
near the rotated coordinate hyperplanes of the common factor space; it 
can be mathematically defined but in practice one is guided to simple 
structure by using graphical representations of the factor loadings at any 
given stage of the procedure. Each factor in the final matrix has nonzero 
loadings, usually, on only a few variables; these variables can often be 
identified, in fact, in the original correlation matrix as constituting a 
highly correlated cluster. Thurstone believes that the use of the principle 
of simple structure leads to invariance of the solution, and hence, other 
things being equal, to invariant results as to the factorial composition of 
tests included in different factor studies. The attainment of simple structure 
often requires that the primary factors be nonorthogonal, i.e., correlated. 
Thurstone introduces second-order factors to account for the correlations 
between primary factors; in some cases, a second-order factor may have 
the appearance of a general factor. Thurstone’s methods have been re- 
garded as moderately laborious, but the multiple-group factoring method 
and certain newer rotational procedures (74, 77, 157), are economical 
improvements over early versions of the technic. Correlation matrices 
with as many as 75 variables have been studied. Simplified multiple-factor 
methods were proposed by Carlson (25), Adcock (2), and Wherry and 
Gaylord (165), altho these methods depart from Thurstone’s premises 
in some respects. The study by Wherry, Campbell, and Perloff (163) 
used the Wherry-Gaylord iterative procedure. 


Burt’s Group-Factor Methods 


Burt (13, 18, 19, 22, 23) prefers to determine a final factor matrix 
which aside from some differences in computation is similar to a Thur- 
stone centroid factor matrix. This factor matrix is interpreted as contain- 
ing, first, a general factor, and then a relatively small number of bi-polar 
group factors, with roughly equal numbers of positive and negative load- 
ings on each, which reveal contrasting groups of variables. Computations 
are relatively easy, and the procedure is fairly straightforward. Burt 
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believes that the method allows the interpretation of results in terms of a 
hierarchy of factors, from very broad factors to very narrow factors. 


Eclectic Factor Methods 


It is by now an oversimplification to speak of sharply defined factor 
methods. Each of the major factor theorists mentioned above has borrowed 
technics from the others, or has proposed alternative methods for different 
types of situations. While Holzinger and Harman (87) prefer a group 
factor method, they show the relations between their method and those of 
Hotelling and Thurstone. Thomson’s position (145) is somewhere between 
those of Holzinger and Thurstone; he has assumed the role of chief critic 
and devil’s advocate in factor analysis (146). Eysenck (51), Cattell (32) 
and Vernon (160) are also eclectic in their methods; Vernon, for 
example, stated that in Great Britain it is now customary to perform a 
group-factor analysis (usually of the Burt variety), but guided by the 
results of a preliminary Thurstone centroid analysis. 


Latent Structure Analysis 


Lazarsfeld (113) recently proposed methods of multivariate analysis 
which have much the same purpose as factor analysis but which reject 
the mathematical model described above and completely avoid the use of 
correlation coefficients. The methods are as yet incompletely developed 
and too recent to allow any evaluation of them. They might be promising 
if computational procedures can be drastically simplified to permit study 
of large matrices. 


The Data Subjected to Factor Analysis 


The data ordinarily subjected to factor analysis consist of the corre- 
lations between a set of variables measured on a sample of persons. These 
variables usually consist of continuous measurements (test scores, ratings, 
physical measurements, etc.). However, sex is sometimes included as a 
variable and Burt (17) suggested that a whole series of qualitative 
variables, like eye-color, can be studied. It is also possible to study 
correlations based on samples of things which are not persons; Davenport 
and Remmers (39) studied correlations between mean test scores and 
other variables measured on states of the U. S., and Coombs and Satter 
(37) studied common-element correlations between occupations. In all 
cases, however, caution must be exercised to see that application of factor 
analysis realistically conforms to the mathematical model it assumes. 

Many studies are based on correlations between persons, either over a 
series of traits (Q-technic) or occasions (P-technic). The theory and 
methods of these technics (variously called inverted or obverse factor 
analysis) have been discussed in numerous books and articles (11, 18, 32, 
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121, 138, 139, 145). Sandler (131) followed Burt in recommending that 
there should be a reciprocity between results obtained by correlations 
of persons and by correlations of tests. The major difficulty in the Q- 
technic has to do with rationalizing a common metric for a series of tests 
applied to a person. 

Any influences affecting correlation coefficients will obviously affect 
factor results. Incidentally, the correlation coefficients must be accurately 
computed; errors are all too frequent. Thurstone (149) and Thomson 
(145) discussed the effects of selection by univariate or multivariate 
restriction. Correlations based on samples of a wide range of ability will 
tend to be higher than otherwise, and if oblique rotations are used the 
primary factors may tend to be more highly correlated. Likewise, as 
Guilford and Michael (68) pointed out, the reliabilities and lengths of the 
tests will affect factor loadings. Theoretically, selection and unreliability 
do not influence the simple structure which is found or the factor com- 
positions of the tests; however, there may be spurious effects arising from 
the use, in the same correlation matrix, of correlations based on different 
samples, and from the use of samples containing groups contrasting widely 
in background or experience. 

One thing that can markedly disturb factor results, even to the extent 
of yielding spurious factors, is the use of correlations (either product- 
moment or tetrachoric) which are systematically affected by disparate 
distributions of the variables and/or systematic errors arising from guess- 
ing by the subjects. These effects become particularly acute in the factor 
analysis of test items which differ in difficulty. Guilford (62) seemed 
to have isolated difficulty factors in the Seashore Sense of Pitch test, 
but altho Wherry and Gaylord (166) think Guilford’s interpretations are 
valid because he used tetrachoric correlations, Carroll (26) showed that 
even tetrachoric correlations are systematically in error when subjects 
guess, and gave procedures for correcting correlations for chance success. 
Other discussions of this problem are those of Ferguson (53) and Lawley 
(111). The important point is that even if a set of items or tests all measure 
the same factor, their intercorrelation matrix cannot have a rank of unity 
unless allowance is made for difficulty and chance-success effects. One 
investigator (27) transformed all nonnormal distributions to approximate 
normality in order to avoid the effects of disparate distributions on corre- 
lations; other investigators used tetrachoric correlation thruout, for a 
similar reason. 

The composition of a battery of tests subjected to factor analysis and 
the factorial purity of the tests will have an effect on the results, as 
Thurstone (148) pointed out some years ago. A common factor cannot be 
found unless it is represented in at least two, and preferably three or 
more, of the variables. This fact alone explains many of the apparent 
discrepancies between different analyses. Furthermore, a broad general 
factor or a strong second-order factor will not be found unless the tests 
are all of a fairly similar character, e.g., all in the realm of cognitive or 
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intellectual tests. On the other hand, the inclusion of tests which are too 
highly similar will convert what should be specific variance into common 
factor variance; likewise, the inclusion of variables which are experi- 
mentally dependent or correlated by part-whole relationships should be 
avoided. Davidson and Carroll’s study (40) exemplifies a technic for 
handling such problems, altho there was some unavoidable experimental 
dependence in this study between measures of speed and level of ability 
taken from the same tests. The study does illustrate the point, however. 
that batteries involving large numbers of time-limit scores cannot yield 
pure and unequivocal factors because time-limit scores are themselves 
often complex, being composed of “speed” and “level” components. No 
large-scale factor investigation thus far has been designed to avoid this 
difficulty. Perhaps factor analysis cannot give clear results until more work 
has been done to make the tests themselves more homogeneous, as Loe- 
vinger (114) suggested. 


Reduction of the Correlation Matrix 


The first, and sometimes the final, step in factor analysis is the reduction 
or transformation of the correlation matrix into a factor matrix. Guttman 
(71) presents the general theory of this reduction. This procedure is 
related to other statistical technics, such as multiple correlation (47, 72), 
canonical correlation (16), and analysis of variance (14). A number of 
methods for estimating communalities are discussed in the major works on 
factor analysis (87, 145, 149). Rosner (130) showed that a unique 
(tho laborious) solution exists, and Wherry (162) presented a useful 
iterative method. 

On the problem of determining the minimum number of factors, Vernon 
(160) regards methods used by the Thurstone school (36) as too lax. 
and British methods as too conservative. He views McNemar’s (117) 
criterion as most useful. Thurstone and his students believe it is wise to 
get slightly too many factors, allowing unused and uninterpretable dimen- 
sions to go into “error space” in the rotations; this makes for easier 
rotation and sharper definition of those factors which prove to be inter- 
pretable, because there is error variance in even the first two or three 
centroid factors. 

Factor analysis has a peculiar status among statistical methods because 
it seldom uses the principles of statistical inference, and the sampling 
distributions of values computed at various stages of analysis are largely 
unknown. Because of the small but real subjective elements in most factor 
procedures it is perhaps futile to look for sampling distributions. On the 
practical side, the invariance of results can be assessed only by the 
intelligent comparison of different studies conducted by similar methods. 
On the theoretical side, tests of significance and sampling problems have 
been discussed by Burt and Banks (24), Saunders (132), Bartlett (8), 
and Henrysson (81). Empirical studies on various problems of sampling 
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and factor invariance have been reported by Mosier (122), McNemar 
(118), Ferguson and Lawrence (54), and Fiske (56), but much more 
work of this kind needs to be done. 


Transformation of Factor Matrices 


Those of the Spearman-Holzinger-Burt frame of mind believe that the 
initial factor matrix is satisfactory as a basis for interpretation of the 
factors, because their methods are designed to produce such a matrix. 
The Thurstone school, however, is in favor of transforming the matrix 
to conform to simple structure. The argument can generally be resolved 
by noting that the initial factor matrices of the different methods are of 
different sorts. Holzinger’s initial factor matrix is roughly similar to 
Thurstone’s final factor matrix. Only in the case of Burt does the issue 
become really controversial; Burt makes interpretation directly from a 
matrix generally similar to a centroid matrix, while Thurstone and his 
students believe that each factor of a centroid matrix is a rather fortuitous 
combination of the simple structure factors. 

Burt (15) and Spearman and Jones (137) defend the practice of 
extracting a general factor initially; Thurstone (149) advocates the use 
of oblique rotations and the determination, from primary factor corre- 
lations, of second-order factors, which may sometimes be interpreted as 
general factors. He defends the notion of correlated factors, pointing out 
that height and weight are separately useful measurements even tho 
correlated. Holzinger (83, 85) shows that the oblique solution is related 
to an orthogonal! solution with one additional (general) factor, but doubts 
that successive determination of higher-order factors yields anything of 
new psychological significance. On the other hand, Tucker (156) and, 
from a different point of view, Vernon (160) think that factors can be 
classified hierarchically. The use of “super-factors” in studying corre- 
lations between persons was discussed by Hsii (93). 

According to Thurstone, full satisfaction of the criteria of simple 
structure can be made only if one allows oblique rotations as occasion 
demands. Nevertheless, Guilford and Lacey (66) used only orthogonal 
rotations in their extensive series of factor studies in the AAF. But if 
two well-defined clusters are obviously correlated, an orthogonal rotation 
possibly leaves too much to subjective judgment: which cluster should lie 
on a coordinate hyperplane? The results of forced orthogonality, Thur- 
stone and his students believe, are not likely to be sufficiently invariant. 
There is need for research on this controversial point. Many writers 
(30, 31, 74, 77, 89, 149, 157), have suggested various procedures for 
making rotations simply and directly. The concept of simple structure was 
criticized, not too convincingly, by Reyburn and others (125, 126). 
Reiersél (124) gave a theoretical discussion of the identifiability of 
factors in simple structure, and Harris (75) presented some mathematical 
developments useful in understanding the rotational problem. Eysenck 
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(50) and Lubin (116) proposed rotations whereby the first centroid 
factor is to be rotated to a position of maximum correlation with an 
external criterion variable. 


Factor Scores 


Most factor methods allow the investigator to compute the estimated 
scores of persons on each factor; full discussions are given in standard 
works (87, 145, 149). Thomson (147) gave new procedures for estimating 
oblique factor scores, and Guilford and Michael (67) presented simple 
methods for getting “univocal” factor scores. Davis (43) treated the case 
of scores on principal-axis factors. 

The hope of some is that by careful test construction, it will be possible 
to find “pure” tests, the scores on which will themselves constitute factor 
scores, Presently available batteries of ability factor tests are somewhat 
disappointing in this respect, and more work needs to be done. 


Factor Interpretation: Statistical Aspects 


The psychological (or sociological, or educational, etc., depending on 
the nature of the data) interpretation of a factor is an operation which 
is over and above the statistical identification of the factor. Nevertheless, 
the choice of factor method has an effect on the data used for interpre- 
tation. Factors found by the Holzinger bi-factor method or the Thurstone 
simple structure technic can be interpreted by comparing variables having 
high loadings with variables having loadings near zero. Halstead (73) 
seems to have violated this procedure. He attempted to interpret some of 
his factors by selecting and describing one test having a high loading, 
with little attention to the loadings of other variables. 

The direct interpretation of centroid factors, as practiced by some early 
investigators attempting to use Thurstone methods, or by Burt (18), 
becomes more complicated but is still manageable, according to British 
writers such as Howie (91) and Vernon (160). Such a procedure was 
criticized by Thurstone (148). The direct psychological interpretation of 
principal factors is still more complicated. Multiple-factor theorists like 
Thurstone would say it is virtually impossible; they prefer to make 
interpretations from a factor matrix in which each factor, and each 
variable, has few nonzero loadings. 


Comparative Factor Analysis 


Because of the apparent discrepancies between the results of different 
factor studies, the impression seems to have arisen in some quarters that 
factor analysis is a hopeless mathematical game which cannot add much 
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to our knowledge. Such an impression may have been partially justified 
a decade ago when factor methods were insufficiently developed or too 
often incorrectly applied, but it is without foundation today, since there 
is wide agreement that different factor methods are interconvertible ways 
of looking at a body of data. Just as a cube seen from different perspec- 
tives will manifest differing apparent shapes, so the various factor methods 
reveal differing apparent structures. Many writers (76, 84, 85, 86, 87, 145, 
149) have presented the mathematical relationships between bi-factor, 
multiple-factor, and principal-component methods. 

An intelligent examination of various factor studies, with full realization 
of the relationships between the methods, shows the results to be in 
substantial agreement. There is need, however, for reanalyses of many 
early studies by more modern methods; among notable reanalyses of this 
type are those by Blakey (9) and Yela (171). There are numerous 
empirical comparisons of different methods (e.g., 1, 7, 13, 127, 142, 161). 
The interchange between Davis (41, 42) and Thurstone (151) is of special 
interest for the comparison of principal-axis and multiple-factor methods. 
Some comparative studies niust be viewed with caution. For example, 
Ferguson’s (52) comparison of Thurstone and Holzinger methods does 
not employ simple structure to best advantage in applying Thurstone’s 
methou; he could have used oblique rotations and derived a second-order 
factor. 


Factor Analysis Results and Applications 


Psychological Theory 


As a research technic in psychology and education, factor analysis 
is an extension of the correlation method in that it seeks to find “what 
goes with what,” but at the same time it usually seeks to reduce the 
number of variables needed to describe phenomena. Factor analysis 
studies have yielded a relatively small number of dimensions for describ- 
ing individual differences in ability, temperament, physique, etc. Beyond 
the mere description of these individual differences, many factor analysts 
(20, 21, 137, 145) have desired to relate the findings to psychological 
theory, or even to make inferences about the “structure of the mind.” 
Thurstone (152) emphasized that this can be done only by making careful 
hypotheses as to the nature of the fundamental traits and the kinds of 
performance in which they function, after which the hypotheses should be 
tested and retested in specially designed factor studies. There still remains 
to be written a rigorous treatment and critique of the processes by which 
factors can be “named” and related to psychological theory. It may be 
suggested here that the making of hypotheses about factors should be 
guided by careful attention to the stimulus-response relationships in various 
categories of behavior, and to the kinds of acquired skills and knowledges 
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which are likely to go together in an individual’s experience. When this 
is done, factor analysis results can be legitimately compared with other 
types of results, in a way that will suggest new avenues of research. Jones 
and Jones (99), for example, showed that factors found in visibility data 
check with experimental results. There is need for studies, using not 
factorial but strictly experimental technics, to investigate the nature and 
conditions of factors. For example, now that the verbal factor V has been 
fairly well delineated by factor studies, we need to know what kinds of 
training or experience can change scores on it, to what extent it may be 
determined genetically, etc. The work of Johnson and others (98) on the 
verbal fluency factor is an example of the kind of research needed. 
Havighurst and Breeze (79) showed relations between factor scores and 
socio-economic variables. 

Kelley (106) and Davis (43) emphasized the need for evaluating factors 
in terms of their importance or social utility. Other writers, however, 
seem to feel that the immediately urgent goal is the scientific determination 
of basic variables, whatever their ultimate utility may be. 


Age Changes in Factors 


Space limitations preclude a careful review of studies dealing with 
factorial results at different ages (e.g., 6, 34, 38, 44, 100, 140). The 
interpretation of such studies is of crucial importance for educational 
practice, but this is difficult because not all studies have been performed 
by identical methods. There are two major questions on age changes in 
mental organization: (a) are there the same number of linearly inde- 
pendent factors at different ages, or do factors emerge at different ages? 
and (b) do the relationships between the factors change with age? The 
former question concerns the rank of the correlation matrix; the latter 
concerns the extent to which factors are correlated or the role of a general 
factor, depending on the method of analysis. After factors are identified, 
one can also be concerned with the growth curves of factors (34) and 
other matters pertaining to their stability. Swineford’s studies (140, 141) 
are among the very few conducted longitudinally. Studies of age changes 
should pay more attention to the effects of selection and the range of talent. 


Factor Analyses of Individual Differences in Ability 


Results in the factor analysis of abilities command serious attention 
from educators, since at least some of these abilities play an important 
role in educational achievement and occupational success. By this time 
there are perhaps 1000 published factor studies devoted to the traits of 
ability; these have, in general, been so well reviewed by Wolfle (169), 
Cattell (29), Vernon (160), and French (58) that we need only give a 
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partial list of the varieties of factors which have been fairly well established : 


. A general factor of cognitive ability 
. The verbal factor, knowledge of native language 
Verbal fluency factors—flow of verbal response, associational facil- 
ity, etc. 
. Reasoning factors, possibly subdivided into inductive and deduc- 
tive abilities 
Memory factors of various sorts, chiefly a rote memory factor 
Spatial perception and visualization factors 
Perceptual speed 
Number ability 
A general mental speed factor 
Factors in motor, manual, and athletic performances 
Numerous factors of sensory capacity, visual acuity, etc. 


A general criticism of the studies accomplished thus far is that they have 
been largely restricted to investigating the kinds of tasks conventionally 
found in paper-and-pencil tests and standard aptitude tests. We know little 
about the factors which function in creative writing, public speaking, 
leadership, administrative judgment, or many other behaviors of everyday 
importance. 


Factor Analysis of Educational Skills and Achievements 


Furst (59) used factor analysis in testing the achievement of educa- 
tional objectives in a progressive curriculum. Woodrow (170) reviewed 
factor studies of improvement with training and concluded that learning 
ability is independent of scores on intelligence tests, that learning ability 
is not a unitary trait, and that group factors involved in learning can be 
measured by a single testing. These results are supported by the findings 
of Carroll (28), Comrey (35), and Wittenborn and Larsen (168), who 
found in school marks and achievement test scores an achievement factor 
independent of general intelligence; they also found group factors for 
subject areas. 

There is need for many more studies analyzing educational achievements 
in particular areas, for example, in reading. Langsam’s study (109) was 
inconclusive because the factors were not properly rotated to simple 
structure. Thurstone (151) concluded that Davis’s study (42) showed 
only one common factor of word-knowledge in a series of nine diagnostic 
tests. These and other studies failed to include a sufficiently wide variety 
of measures, e.g., measures of visual acuity, eye-movement characteristics, 
memory span, etc. A promising approach relating reading performances 
to auditory abilities was made by Ewers (49). 

Factor analysis research is in a similar state of underdevelopment in 
many other subjectmatter areas like mathematics, science, social studies 
foreign-language study, music and art, etc. The few good studies reported 
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have been such isolated examples that one cannot assess the validity of 
the results by comparison with other studies. 


Factor Analysis of Personality 


The extensive research in the factor analysis of personality, including 
such aspects as interests and body build, has been summarized by Cattell 
(32) and Eysenck (51). Several paper and pencil inventories have been 
constructed with the aid of factor analysis (64, 105). Baehr (4) analyzed 
three of Guilford’s scales to obtain second-order factors for a broader 
frame of reference. Cattell and Saunders (33) analyzed personality data 
from behavior ratings, questionnaires, and objective tests and did not find 
as much agreement as they expected. The relationship to personality of 
the level of muscular tension and the physiological reactions to stimulation 
was investigated by Duffy (45) and Freeman and Katzoff (57). 

Examples of factorial studies of personality which have more immediate 
relevance in educational work can be cited. The personality of boys as 
rated by teachers (91) at different ages (60) and of delinquent boys (78) 
has been investigated. Koch (108) studied preschool children and found 
second-order factors of restraint-expansiveness and socialization. Rofi 
(129) analyzed the Fels Parent Behavior Scales and suggested that the 
traits of the parents may predict those of the children. Baldwin’s (5) study 
of ratings of nursery-school children found a factor of “desirability. 
representing the concept of the good nursery-school child.” The traits of 
teachers have been investigated by Schmid (133) and Hellfritzsch (80), 
the latter using several criterion measures in the battery. 


Other Illustrative Applications 


Factor analysis has been used in many special areas of research where 
the experimental designs and findings may be of interest to educators. 
Jones and Jones (99) used factor analysis to test color vision theories. 
Hsii (92) studied olfaction and Karlin (102) studied auditory functions. 
In the area of industrial psychology, personnel rating data (10, 107, 163), 
job evaluation scales (3, 112), job families (37, 96), and occupational 
interests (159) have been studied factorially. Hughes (95), Sen (134), 
and Wittenborn (167) studied scores on the Rorschach Test, Hsii and 
Sherman (94) analyzed electroencephalogram readings, and Halstead (73) 
studied brain functions in subjects who had undergone lobectomy or 
lobotomy. A study by Rafferty and Deemer (123) brought out the inter- 
esting suggestion that clinical assessments made by psychiatrists attempting 
to predict success in aviation pilot-training do not represent any insight 
over and above statistical predictions from data. Fiedler (55), using the 
Q-technic, found that in effecting psychotherapy the experience of the 
therapist is more important than the method he uses. This study design 
might be applied in the investigation of teaching methods. Hofstaetter 
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(82), in studying cultural patterns in the United States, included in his 
matrix several measures related to educational characteristics of states. 


Factor Analysis in Test Construction and Criterion Research 


There have been at least three major attempts to construct batteries 
of mental ability factor tests for practical use (46, 69, 153). Prediction 
studies using these or similar batteries (88, 135) have shown them to 
have considerable promise, altho many more validity studies need to be 
done. Factor analysis is also useful in evaluating the factorial compositions 
of widely used single tests, such as the Stanford-Binet (101, 119) or the 
Wechsler-Bellevue (6). 

Factor analysis has had its effect in test construction and the theory 
of tests. Lovell (115) reported some difficulty in designing pure factor 
tests according to certain hypotheses about the nature of the factors, but 
her correlational technics are subject to some criticism on the ground 
that difficulty and chance success influences were not adequately controlled. 
Factor analysis (158) and related technics (114) have been used to 
determine homogeneity of test items. Keir (103) reported lack of homo- 
geneity in items of the Progressive Matrices Test even tho they appeared 
similar. Wherry and Gaylord (165) pointed out the influence of item 
factor patterns on the Kuder-Richardson reliability formulas. 


Guilford, from his experiences with the AAF (66), recommended a 
factorial approach to test development (63) and test evaluation (65). 
He advocated the inclusion of criterion measurements in the factor 
battery for a better understanding of what parts of the criterion are pre- 
dicted by the tests. This leads to the formulation of hypotheses concerning 
tests to predict the remainder of the criterion. Gulliksen (70) gives 
examples of how criteria can be improved thru such knowledge. Michael 
(120) found the same factors in the tests given to four groups from three 
populations but “the factorial composition of the pilot criteria was 
markedly different for these populations.” Wherry and Fryer (164) used 
factor analysis in showing that “buddy-ratings,” widely used as criteria 
in industrial and military studies, are not mere “popularity contests.” 


Needs 


. 


By way of summarizing the numerous suggestions for research which 
have been made thruout this paper, we would first point out the need for 
an up-to-date, unbiased synthesis and evaluation of all present methods of 
factor analysis, bringing together both American and British methods, for 
the purpose of setting forth standard procedures which can be widely 
used. At the same time, methodological developments should include 


studies of the sampling distributions of factor statistics, studies of the 


invariance of results with different samples or different test batteries, and 
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empirical investigations of the extent to which trained computers can 


agree in their results in view of the subjective judgments they have to 
make. 


Studies applying factor analysis must be better designed, using batteries 
of measures carefully constructed and carefully chosen to meet the require. 
ments of a given set of hypotheses. Tests which are obviously factorially 
complex (e.g., containing both speed and power aspects) should not be 
allowed to define factors. Correlational statistics which will give rise to 
spurious factors should be avoided. Factor analysis methods should |e 
carefully chosen in view of one’s purpose and correctly and accurately 
applied. By no means all the studies cited here have been correctly per. 
formed; many need to be reanalyzed. The use of external criterion measure. 
ments in factor batteries should be more common. 

Factor results should be more éarefully interpreted in the light of 
other kinds of information, including results from strictly experimental 
studies. Much more needs to be learned about the stability of factors, the 
effects of training on them, and their role in various types of performance. 
The predictive validity of factor tests must be further investigated. Further 
work needs to be done on the use of factor analysis in test construction. 
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CHAPTER V 
Recent Developments in Statistical Theory 


PALMER O. JOHNSON and WILLIAM J. MOONAN 


Tue papers included in this review represent a selection from the litera- 
ture of statistical theory from January 1948 to the time of writing. The 
field of statistical science is growing so rapidly that, while the list of 
references looks formidable, the authors could hope to cover only a small 
and biased sample of it. It is necessary, of course, to step outside the 
bounds of professional education, since the advance in statistical theory 
today is largely in the hands of the mathematical statistician, particularly 
those who are both statisticians and research workers. The subjectmatter 
is divided into three major parts: (a) probability theory; (b) statistical 
inference; and (c) design and analysis of statistical investigations. A 
summary of the main statistical books of the period is presented first. 


Books 


More well-written, rigorous textbooks in descriptive statistics, design 
of experiments and probability theory have been published in the last 
three years than in any comparable period. This is not to say that the 
ones currently written are the best, but the general standard is certainly 
higher. However, while books on specialized statistical subjects are now 
appearing in increasing numbers, it is unfortunate that among these no 
complete treatment of multivariate analysis technics is to be found. 

Three needed books on sampling theory have been published. Yates 
(221) treated such topics as the avoidance and correction of bias, possible 
difficulties encountered in sampling procedures, the structure and effi- 
ciencies of different sampling plans, and the estimation of error and popu- 
lation parameters. A book by Deming (40) covers much the same material, 
but gives a broader analytical treatment of the sampling problems and 
includes more material typical of that found in the traditional textbook 
on statistics. Parten (151) wrote a useful text which contains good advice 
on questionnaire construction and interview procedures. 

A beginning text by Wilks (213) covered the general topics of elemen- 
tary statistics with accuracy and clarity while Quenouille (165) included 
an excellent chapter on transformations appropriate to changing data to 
make them conform to the assumptions of the statistical methods used. 
MeNemar (131) and Guilford (74) treated a variety of applications of 
technics to the problems of psychology. Johnson (98) provided a com- 
plete discussion of the basic technics of modern experimental design and 
analysis, the tests of the underlying assumptions, and the calculations 
involved. This book also treats some multivariate problems and the analysis 
of variance and co-variance of nonorthogonal data. A recent text by Dixon 
and Massey (45) covers similar materials and includes special chapters on 
power functions, macro- and micro-statistics, and a set of tables. 
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A more mathematical treatment of probability, problems of estimation, 
and tests of hypotheses has been provided by Mood (132). While mainly 
concerned with industrial applications, Tippett (189) lucidly described 
problems connected with investigation and experimentation. The theory 
of ranking has great practical value in view of the difficulty of assigning 
numerical measures occurring especially in some psychological data, 
A thoro treatment, both practical and mathematical, has been written by 
Kendall (105). He discussed Spearman’s and his own rank correlation 
coefficients with regard to their relationship, distribution and application. 
The partial rank correlation coefficient defined by Kendall will be 
valuable statistic when its sampling distribution is found. 

The teaching and instructions of Karl Pearson and Jerzy Neyman are 
evident in David’s (383) book on probability theory. She treated the 
binomial and Poisson distributions ‘in detail and included extensive treat- 
ment of the Markoff Theorem on least squares and characteristic functions. 
Neyman (148) has also written an important elementary book in proba. 
bility and statistics. More advanced problems are discussed by Feller (56) 
while Munroe (142) has written a probability text intended for students 
who have no more than a working knowledge of the calculus. All of these 
textbooks provide sets of practical and theoretical probability exercises 
which are superior to the usual ordering and coin tossing problems. 

Cochran and Cox (30), who have had long and varied experience in 
the field of experimental design, have written an excellent book on the 
subject. The first three chapters deal with the proper planning and analysis 
of an experiment. Specific chapters treat in detail the specialized problems 
associated with randomized block, factorial, Latin square, and lattice 
designs. This book already has become a standard reference and will 
continue to be one for many years. Edwards (48) has also written a 
methodological book which is not theoretical and uses data from psy- 
chology to illustrate principles of experimental designs. The topic of 
analysis of variance and co-variance from the point of view of a mathe- 
matical statistician is given in a monograph by Mann (124). Relatively 
new concepts, such as minimization of maximum risk, loss and complete 
class are fundamental to Wald’s (202) statistical decision functions. With 
these notions, experimentation becomes a multi- rather than a single-stage 
operation. 

Anthologies of the published works of Fisher (59) and of Pearson 
(160) have recently been compiled, and contain most of their significant 
contributions to statistics. Collected works by Neyman (149) and Stouffer 
(186) deal respectively with advanced mathematical statistics and prac- 
tical problems of measurement. Wiener (212) studied the relations of 
mathematical calculators, automatic pilots and servo-mechanisms to the 
human nervous system. An interesting discussion of the methodological 
basis of experimental inference was written by Churchman (25). Walker 
(203) contributed an important paper on statistical literacy in the social 
sciences. 
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This last year was the silver jubilee of Fisher’s Statistical Methods for 
Research Workers. Tho often criticized for various reasons, no statistical 
text yet published has enjoyed such a large influence over so many experi- 
menters and statisticians. Yates (220) and Mather (129) paid just tribute 
to Fisher’s book, Youden (222) to his influence, and Hotelling (91) to 
the man. Hotelling says of him, “Statistics is entering a new era of better 
methods, sounder basic ideas, more adequate mathematical criticism and 
constructive activity, faster progress, and greater usefulness in more and 
more kinds of application. For contributing a powerful impetus to this 
movement we have to thank Ronald Aylmer Fisher.” 


Probability Theory 


Modern statistics from the standpoint of its historical antecedents is 
the resultant of the fusion of two independent disciplines. The one was 
essentially descriptive—interested in the collection of data. The other 
was primarily analytic—aligned with ideas of chance and probability. 
The intimate mutual interaction between the development of probability 
theory itself and its application in statistics is a feature of modern 
statistical theory. 

In probability theory, the appropriate theoretical relations between 
probability numbers and limiting frequency ratios are furnished by 
mathematical limit theories. Loeve (118) presented an analysis of the 
present status and in so doing placed in relief the role and interconnec- 
tions of the fundamental limit theorems. Madow (121) showed that under 
very broad conditions the usual theorems concerning the limiting distri- 
butions of estimates hold for estimates based on samples from finite 
populations selected at random with replacement. 

Today there is considerable attention given in probability and statistical 
theory to stochastic processes. In the stochastic model the limit operation 
is carried out in the definition proper of the model and the time parameter 
may thereby vary continuously. These models of stochastic or random 
processes with a continuously varying time parameter are most valuable 
in the mathematical treatment of statistical problems in many branches of 
science. For instance, the importance of stochastic processes has been well 
demonstrated in relation to problems of population growth. Kendall (102) 
gave a complete solution of the equations governing the generalized birth 
and death process in which the birth and death rates are given as any 
specified functions of time. Metropolis and Ulam (130) treated a method 
called the “Monte Carlo” method, a statistical approach leading to certain 
solutions of equations occurring in various branches of science. In this 
case a mixture of both deterministic and stochastic processes is involved 
in the mathematical description. 

Bose (15) treated a problem of two dimensional probability of par- 
ticular value in areal surveys. Two highly critical and informative articles 
dealt with the theoretical foundations of probability: Kendall (103) 
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attempted to synthesize the frequentist and the nonfrequentist theories 
of probability. Tintner (187) described briefly the existing systems of 
statistical inference and their relations to the foundations of probability, 
proposed by the logician and philosopher Rudolf Carnap. 


Statistical Inference 


The broad problem of statistical inference consists of determining the 
uncertainty of conclusions drawn from experimental or observational 
data. The theory of probability is basic in the attack on this problem, 
since it makes possible probability statements concerning the correctness 
of the conclusions drawn, 

The studies reviewed have been classified into three main categories: 
(a) tests of statistical hypotheses; (b) statistical estimation; and (c) 
statistical decision functions. 


Tests of Statistical Hypotheses 


In the first place it may be well to attempt to distinguish between two 
contemporary points of view with respect to the testing of statistical 
hypotheses. In testing the null hypothesis, there is a clearly stated null 
hypothesis chosen by the experimenter as appropriate to his inquiry and 


a comparatively undefined set of alternative hypotheses. If sufficient 
experimental evidence becomes available, the null hypothesis is rejected. 
The experimenter does not attempt to weight the evidences from the data 
in relation to the several alternative hypotheses. Where a single hypothesis 
is to be chosen, the procedure of choosing that hypothesis which maxi- 
mizes the likelihood is the one advocated by Fisher. He objects to any 
formal theory that definitely and directly employs alternative hypotheses 
and that does not always lead to tests of significance which are inde- 
pendent of any assumed alternatives. Neyman and Pearson in their 
approach to testing hypotheses bring in alternative hypotheses as an 
intermediate step. They do not have in mind ultimately making any 
assumption about the alternative hypotheses. When a uniformly most 
powerful test exists, provision is automatically made for the assumption 
about alternative hypotheses. However, since uniformly most powerful 
tests do not often exist, tests were introduced which depended on locally 
powerful unbiased regions. These regions attain maximum power only 
when they are in the neighborhood of the null hypothesis. This limits the 
usefulness of the test. The extension of the use of the power function 
thruout the whole range of admissible values of a parameter introduces 
the unknown parameters or their a priori distribution into the test criteria. 
The general theory of testing statistical hypotheses has been particularly 
useful in furnishing methods which make it possible to compare various 
test criteria for a specified set of alternative hypotheses. 































































December 1951 





DEVELOPMENTS IN STATISTICAL THEORY 









Chernoff (24) discussed the problem of treating nuisance parameters 
in the finding of similar regions, i.e., regions whose size is independent 
of the nuisance parameters. While the likelihood ratio test has a number 
of desirable properties, at least in the parametric case, Lehmann (113) 
pointed out cases where this criterion is unreliable and hence the need 
for developing a systematic theory of testing hypotheses. A considerable 
number of studies have dealt with the power of a statistical test, that is, 
the measure of the sensitivity of a test with reference to the control of the 
second type of error. Wolfowitz (216, 217) gave short and simple proofs 
of the optimum properties of a number of classical tests: general linear 
hypotheses, multiple correlation, Hotelling’s 7, analysis of variance. 
Pearson and Merrington (159) provided tables and graphs for deter- 
mining the power of a test of randomness in a 2 x 2 contingency table. 
They treated the case where only a single process of random sampling or 
partition is called for. Patnaik (153) also treated the power function in 
a 2 x 2 table, but was concerned with the case where two separate random 
selections are involved and the cell contents have two degrees of freedom. 
A useful result of the power function is the determination of how large 
a sample should be taken in order to obtain a conclusive result in an 
experiment. Poti (162) gave a derivation of the power of the Chi-square 
test for comparison of two multinormal populations. In order to calculate 
the power of a test it is necessary to know the probability that the statistic 
exceeds a given value for hypotheses alternative to the hypothesis under 
test. This probability is given by the noncentral integral of the statistic 
studied. Patnaik (152) derived certain approximations to the probability 
integrals and gave numerical examples of their use in experimentation. 
Lehmann and Stein (114) discussed the most powerful tests of composite 
hypotheses. Lord (119) studied the power of a modified t-test based on 
the range, and Walsh (208) computed the power efficiency of Sheffés’ 
test for the Behrens-Fisher problem. In another study (207) he made use 
of the power function to determine how much “information” is lost in 
using some other test in place of the most powerful test. Ghurye (66) 
found an approximate expression for the power of Student’s t-test when 
applied to samples from an asymmetric population. 

A number of parametric tests have received other consideration. Para- 
metric tests are those which specify the value of a parameter for a given 
function or specify one or more functional relationships among two or 
more population parameters. They are usually spoken of as classical tests 
and deal with samples from populations having normal, binomial, multi- 
nomial, and other specified forms of distribution functions. Chaud (23) 
studied the relative merits of different statistics available for comparing 
two means or two regression coefficients in relation to one-sided (asym- 
metric) and two-sided (symmetric) alternatives when the population 
variances are unequal. His treatment included the Behrens-Fisher statistic 
in relation to Type I errors and populations with a fixed value of the 
unknown ratio of variances. Walsh (205) dealt with a generalization of 
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the Behrens-Fisher problem. Walsh (206) also presented an easily applied 
method for determining the best choice from the viewpoint of practical 
considerations (cost, difficulty, etc.) of sample sizes for a t-test when the 
ratio of the variances is known. Gayen (62) derived the theoretical distri- 
bution of Student’s ¢ in the non-normal case for samples of any size with 
reference to the parent population specified by a number of terms of 
the Edgeworth series. Tail area probabilities of certain corrective terms 
due to the third and fourth cumulants have been tabulated. He also (64) 
derived a test for the significance of the difference between the means 
of two samples from non-normal universes, also based on the Edgeworth 
series. Hyrenius (93) studied the distribution of the Student-Fisher ¢ in 
samples from compound normal functions showing that by varying the 
parameters of the compound normal functions, these can be made to 
describe a variety of different distribution types. David (39) proposed two 
combinational tests of whether a sample has come from a given popula- 
tion. One of these tests was designed to be used for small samples instead 
of the Chi-square test which is dependent upon the use of large samples. 
The second test makes it possible to specify broadly which departure from 
the hypothesis under test it is most important not to overlook. 

In addition to the assumption of normality, the assumption of stability 
of variance underlies the unrestricted use of many statistical tests. The 
study of variability itself is of fundamental importance. The study of the 
relation among variances will likely increase in importance with the 
increased interest in the components of variance technics. Cochran (28) 
derived an approximate F-test for a linear relation among variances. 
Gayen (63) derived certain mathematical forms of the frequency distri- 
bution of the variance ratio for samples drawn from certain non-normal 
populations. Practical applications were considered and diagrams were 
constructed to show the comparative effects of parental skewness and 
kurtosis on the normal theory. A basic assumption underlying the appli- 
cation of the least squares method is that the error terms in the regression 
model are independent. The problem of testing the errors for independence 
formed the subject of a paper by Durbin and Watson (46). Gulliksen and 
Wilks (75) stated and illustrated the uses of the likelihood ratio for 
testing hypotheses about the constancy of the standard error of estimate, 
slopes of regression lines, and equality of intercepts of regression lines. 
Green (70) used the likelihood criterion for testing the significance of 
differences of the standard error of measurement for different groups. 

The problem of testing outlying observations is of considerable impor- 
tance to the analyst and research worker. Many and various tests have been 
proposed by statisticians. Grubbs (72) summarized a large number of 
these tests and proposed additional tests with their mathematical deriva- 
tions and applications. Dixon (43) contributed to this problem. He (44) 
also gave analytic results for the distribution of ratios involving extreme 
values, a function designed for testing the consistency of suspected values 
with the sample as a whole. In speaking of individual observations, it 
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may be pointed out here that there is much need in clinical psychology, 
in guidance and personnel work, and in medical diagnosis for making 
objective decisions about individuals from clinical tests, test scores, and 
many other types of observational data. 

The importance of the x? (Chi-square) test continues to increase. Lewis 
and Burke (117) presented nine types of errors found in applications of 
the Chi-square test with illustrations taken largely from psychological 
literature. Williams (215) described the technic suggested by Mann and 
Wald for selecting the number and width of class intervals for the Chi- 
square test of goodness of fit when the null hypothesis distribution is 
continuous and completely specified. The usual yx? test in contingency 
tables covers all forms of departure from proportionality. It is corre- 
spondingly insensitive to departures of a specified type. A test based on 
regression concepts was provided for such a circumstance by Yates (219). 
The familiar x? test for comparing the percentages of successes in a 
number of independent samples was extended by Cochran (26) to the 
situation in which each member of any sample is matched in some way 
with a member of every other sample. Bowker (16) treated the mathe- 
matical basis leading to a test of symmetry in contingency tables. Edwards 
(50) outlined methods of correcting for continuity in tests of the signifi- 
cance of the difference between correlated proportions. There is some 
disagreement as to when Yates’ corrections should be applied in con- 
tingency tables (2 x 2) and whether the Fisher-Yates method of calcu- 
lating the exact value of x? is justifiable. Finney (57) provided a table 
of exact significance levels (when exact test is applicable) when all the 
frequencies are small. A simple rule extends the applicability of the table 
to contingency tables with larger frequencies. 

The common methods of combining probabilities from different experi- 
ments by making use of the additive properties of x? have been found to 
introduce biases which may diminish the power of the tests in the cases 
of the binomial with a low index and of the fourfold table with small 
numbers. Lancaster (111) has given approximations that give more 
efficient tests in the case of data from discontinuous distributions. 

Recent developments in nonparametric statistical inference are of 
importance. This type of inference is concerned with situations in which 
distribution functions are not specified. Some knowledge of the distribu- 
tion function may be available, (e.g., they are continuous, unimodal, 
bimodal, etc.) but there are many problems in which one cannot assume 
the functional form of the population distribution. Distribution-free tests 
or nonparametric tests are the basis for testing nonparametric hypotheses, 
The nonparametric tests now available have been based largely on intui- 
tion and experience resulting from the application of well-established 
parametric test theory. Order statistics have produced a number of such 
tests. Order in this case refers to the ordered set of values in a random 
sample from least to highest. Wilks (214) has presented a comprehensive | 
exposition of some of the more important findings as regards order 
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statistics. Nonparametric tests have important uses in educational research, 
Tests of hypotheses about the type of distribution (e.g., a Chi-square test 
of fit to a normal distribution) are nonparametric tests. Massey (128) 
has treated the Kolmogorov-Smirnov test, a distribution-free test, as an 
alternative to x? as a test of goodness of fit and some evidence was pre- 
sented that when it is applicable, it may be a better all around test than 
x’. Walsh (209) has derived significance tests for the population median, 
which are valid under very general conditions, and presented a number 
of applications of these tests (204). Moran (135) has reviewed recent 
developments in ranking theory and (134) derived a curvilinear ranking 
test with tables with probability values for M up to 14. Hoeffding (89) 
proposed a test for the independence of two random variables with con- 
tinuous distribution function. Mosteller (136) proposed a k-sample or 
location test for testing the hypothesis that one of the samples of k-popu- 
lations which has the largest “observed slippage” to the right, say, as 
measured by counting the number of observations in it which exceed al! 
observations of all the other samples, does come from a population cen- 
tered farther to the right than the other populations. Mosteller and Tukey 
(140) developed approximate tests for k equal samples of n = 10. 

Several tables of test criteria of importance to research workers have 
been recently published in Biometrika. Hartley and Pearson (84) ex- 
tended the Student-Fisher ¢-table to a finer interval and five decimal place 
accuracy. The same authors (85) developed tables giving P(x’, v) to 
five places. Aspin (2) extended Welch’s work on the problem of com- 
paring two means and provided tables for setting up confidence intervals 
for means of equal sized samples at the 5 percent level. She (3) also 
provided tables for use in comparing two means when the variances are 
unequal. Gumbel (76) published tables of the reduced range to five 
significant decimal places for an unlimited symmetrical population dis- 
tribution of the exponential type. Bancroft (4) presented examples to 
illustrate methods of calculating probability values outside the range 
of available tables for a large number of important tests. 


Theory of Statistical Estimation 


The estimation of parameters is a central purpose of all scientific 
experimentation and of enumerative sample surveys. The problem of the 
theory of statistical estimation is that of estimating values (statistics) of 
the unknown parameters of distribution functions of specified mathe- 
matical form from samples assumed to have been drawn from such popu- 
lations. The method of maximum likelihood is now generally used in a 
great variety of problems of statistical estimation. Cochran (27) set up 
an excellent model illustrating the attack and solution of a scientific 
problem in estimation. He emphasized the importance of the practitioner 
having a clear understanding of the underlying assumptions, showed how 
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to translate a practical problem into a form amenable to mathematical 
analysis, and showed how to proceed to obtain maximum likelihood 
estimates. Wald (200) considered the asymptotic properties of the maxi- 
mum likelihood estimate in the case of stochastically dependent observa- 
tions. He showed that under certain restrictions on the joint probability 
distribution of the observations, the maximum likelihood equation has at 
least one root which is a consistent estimate of the parameter under con- 
sideration. Votaw, Rafferty, and Deemer (199) derived maximum likeli- 
hood estimators for certain parameters in a truncated trivariate normal 
distribution when the values of the other parameters are known. Kullback 
and Leibler (110) showed that information in a sample defined in the 
Shannon-Wiener sense cannot be increased by any statistical operations 
and is invariant only if sufficient statistics are used. Woodbury (218) 
introduced a new descriptive parameter for tests which he called the 
standard length and showed to be related to Fisher’s concept of informa- 
tion. Ehrenberg (51) considered data in a two-way or higher classifi- 
cation without replication. He derived two unbiased quadratic estimates 
and an estimate based on the range for heterogeneous error variances. 
Seth (181) dealt with recent results on the lower bound to the variance 
of unbiased estimates. Goodman (67) devised a means of estimating the 
total number & of classes which subdivide a population on the basis of 
the sample results and knowledge of the population size. An application, 
for example, could be made in the problem of estimating the number 
of different words in a book. 

Important problems in psychological and educational analysis deal 
with the study of responses in multi-choice situations. Freund (61) de- 
veloped a statistic for the estimation of the frequencies with which 
responses are distributed within a class of alternative responses. An 
approximate formula for the variance of the estimate was obtained for 
large samples. Smith (183) presented a technic for computing the 
precision of two measuring instruments when there is a linear relation 
between the scales of the two measurements. Grubbs (71) reported a 
statistical study of methods for estimating and comparing product varia- 
tions and errors of measurement. Desmonde (42) illustrated the plotting 
of dats by means of polar coordinates. 

Many applications have been found especially in biology for the 
logarithmic series developed by Fisher in 1943 in an investigation of the 
frequency distribution of numbers of species of animals obtained in 
random samples. Fisher derived the distribution by first considering a 
Poisson distribution with mean M, since when homogeneous material is 
involved, this is usually the observed distribution. However, when the 
material is heterogeneous it is no longer possible to assume that M is 
fixed for all samples. Fisher considered the superposition of a set of 
Poisson distributions as resulting in one over-all distribution: the nega- 
tive binomial distribution. Quenouille (169) worked out the relation. 
between the logarithmic, Poisson, and negative binomial series. Sichel 
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(182) pointed out that as an analytical tool the negative binomial distri. 
bution may have wide application in the psychological field. Practical 
applications of this series include absence proneness, accident proneness, 
and errors incurred on a two-hand coordination test, used primarily as 
a measure of speed of performance. 

Bartlett (6) presented and illustrated a method of fitting a straight line 
when both variables are subject to error. Housner and Brennan (92) 
dealt with the problem of bivariate regression when both variates are 
random variables that have a finite number of means distributed along 
a straight line. Workers in educational research could profit from a study 
of the various types of regression analysis problems reported by Bliss and 
Young (14), Koop (109), Rao (171), and Birren and Botwinick (13). 

Lev (116) gave methods and showed how tables of the noncentral 
t-distribution may be used to calculate tests of significance and confidence 
limits for the population correlation as estimated by the point biserial r. 
Mosteller (137, 138, 139) studied various aspects of the method of 
paired comparisons. 

Considerable attention is being given problems of nonparametric esti- 
mation, which are of considerable importance in educational research. 
Tukey (193) extended his studies of nonparametric estimation to include 
the discontinuous case. The question arises as to whether the idea of a 
confidence interval can be extended to the nonparametric case. Such an 
extension was made by Wald and Wolfowitz, and Noether (150) gave 
a survey of certain methods available for finding confidence limits in the 
nonparametric case. Patnaik (154) studied the mean range as an esti- 
mator of the variance in statistical tests. He obtained an approximation 
to the distribution from which he derived the ratio of a normal variate to 
the mean range, and showed that in the case of a single difference between 
two means and in the case of a one-way analysis of variance, a U-test 
(normal deviate) resulted in the same conclusions as the ¢ and F-tests. 
However, the power of the U-test is always less than that of the /-test. 
Pearson (158), on the basis of certain experimental sampling results, 
concluded that when an adjustment appropriate for a normal population 
is made, the range is an efficient estimator of the standard deviation, pro- 
vided that groups of not more than about 10 observations are taken. 
Hence, with small samples when large numbers of calculations must be 
made, the use of the range test would be advantageous. Benson (7) indi- 
cated that it is sometimes useful to use quantiles for estimating variabilities 
and central tendency. Kendall (104) found that + (tau), his measure of 
rank correlation, is relatively insensitive to departure from normality, 
particularly in comparison with Spearman’s p (rho). He found that 7 is 
not very accurate as an estimator of product-moment correlation. Daniels 
(36) found that even tho Kendall’s and Spearman’s measures of rank 
correlation are highly correlated in the case of independence, they do 
not describe the same aspect of a separate bivariate population of ranks 
when the correlation is not zero. Whitfield (210) defined and illustrated 
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a coefficient of rank intraclass correlation and gave probability tables 
for testing its significance. Whitfield (211) also discussed the uses of the 
ranking method in psychology. 


Theory of Statistical Decision 


The two types of problems of statistical inference just considered do 
not cover all the possibilities. A third problem met with in practice is to 
decide on the basis of the observations which of a number of hypotheses 
should be accepted. A new theory developed largely by the late Professor 
Abraham Wald was designed for this type of problem. It is known as the 
theory of decision functions. This new theory centers about the problem 
of statistical action or of selecting one of a plurality of possible actions 
on the basis of the observations. Wald suggested that a certain weight or 
risk function be associated with each hypothesis. This function expresses 
the loss expected as a consequence of the method of selection. Thus, if a 
certain hypothesis were true, then there would be certain losses attendant 
to acceptance of each of those several alternatives. Obviously if the true 
alternative were accepted, the loss would be zero. The minimax principle, 
i.e., minimizing the maximum loss, is basic in the theory. The initial step 
in the application of the minimax rule is the calculation of the loss func- 
tion. The mathematical problems underlying the minimax rule are in 
principle identical to those encountered in the theory of (zero-sum) two- 
person social games formulated by Von Neumann and Morganstern in 
their book, Theory of Games and Economic Behavior. 

Sequential analysis and experimentation may be thought of as a special 
case of the statistical decision function. Here there is a dichotomy of 
accepting or rejecting a hypothesis. The decision of reaching a conclusion 
or continuing the experimentation depends at any stage on the informa- 
tion from the observations available up to that point. 

Savage (180) contributed an excellent review and interpretation of 
Wald’s recent book, Statistical Decision Functions. Wald (201) presented 
the main results of an earlier paper on the foundation of a general theory 
of statistical decision functions and reported new results obtained under 
less restrictive conditions. Paulson (157) applied a multiple decision 
procedure to certain problems in the analysis of variance. Hodges and 
Lehmann (88) considered the problem of point estimation in terms of 
risk functions, without the customary restriction to unbiased estimates. 
Wolfowitz (217) proved that the classical estimation procedures for the 
mean of a normal distribution with known variance are minimax solutions. 

Rao (173) derived sequential tests of null hypotheses in experimenta- 
tion. Moonan (133) explained the theoretical foundation of sequential 
analysis and applied the technic to the scoring of achievement examina- 
tions. Kimball (107) discussed and illustrated sequential sampling plans 
particularly with reference to checking large-scale psychological testing 
programs for accuracy of scoring. 





Review oF EpucaTIONAL RESEARCH Vol. XXI, No. 5 





Design and Analysis of Statistical Investigations 


Statistics provides the doctrine of the planning of experiments and 
collection of other types of observations, and of interpreting their 
consequences. 

This section deals with the theory of the design of experiments and 
sampling surveys and the theory of multivariate analysis. In this con- 
nection, the function of modern statistical science in educational and 
psychologica] research has been discussed by Johnson (96). 


Theoretical Developments Underlying Experimental Design 


Because the designing of an experiment is somewhere between an art 
and a science, there can be no specific rules telling the experimenter which 
design to use in a research problem. This decision is usually dependent 
upon the accuracy and efficiency desired, as well as upon features inherent 
in the experimental situation. However, it is true that problems associated 
with correct sample size are being clarified. Steps in this direction have 
recently been undertaken by Harris, Howitz, and Mood (79). It is 
probably true that the more simple designs will be used most frequently. 
Cox (34) gives a table of the number of times certain designs were used 
at an agricultural station and also a table of the relative efficiencies of 
lattices compared to randomized block designs. 

A factorial experiment is one in which all combinations of the factors 
involved are included in each experimental unit of the design. Confound- 
ing makes possible a reduction in the number of such combinations per 
experimental unit but still includes all combinations in each replication. 
If many replications are desired to insure accuracy, it is possible to keep 
costs down by studying only a part of the possible combinations. This 
device is known as fractional replication. Nair and Rao (146) explained 
the sufficient combinational conditions which lead to the construction of 
confounded factorial experiments. An explanation of the theory of frac- 
tional replication in factorial experiments and its extension has been 
given by Rao (176) and Banerjee (5). 

The utility of the Latin square principle in psychology has been recog- 
nized by Grant (69) and Bugelski (21). Edwards (49) pointed out that 
by replication of the same Latin square, rather than orthogonal ones, 
row variation can be split into two components thus enabling the variation 
of the subjects (columns) presented with the same order to be identified. 
As pointed out by Youden (223), it is possible for a certain set of 4 x 4 
Latin squares to separate the residual interaction sum of squares of rows 
and columns into components which are associated with comparisons of 
the treatments. The use of restricted randomization to preserve inde- 
pendence qualities of plots and achieve a uniform degree of accuracy 
for nearly all plans of Quasi-Latin squares is discussed by Grundy and 
Healy (73). 
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The general class of designs known as incomplete block designs, which 
are useful when there are many treatments to be examined, can be sub- 
divided into categories that include balanced incomplete blocks, lattice 
squares and lattices. The treatment of the theory connected with these 
designs has been extended by Nair (143), Harshbarger (80), Kempthorne 
(100, 101) and Federer (54, 55). 

The use of quantitative ancillary information regarding experimental 
units may be used subjectively, as in matched pairing, or statistically, by 
using the analysis of co-variance. Engelhart (52) compared some aspects 
of these two technics. An expository paper emphasizing the uses of the 
analysis of co-variance was written by DeLury (41). Quenouille (163) 
showed that when slight nonorthogonality exists in an experiment, it is 
possible to carry out the analysis by co-variance methods. Stevens (185) 
presented an analysis of a nonorthogonal tri-factorial experiment. 

Considerable work has been done on transformations appropriate for 
correcting for non-normality and heterogeneity of variance in design 
problems. A test for heterogeneity of variance which considers only the 
ratio of the largest variance to the smallest and which has power only 
slightly less than Bartlett’s M-test was given by Hartley (82). The problem 
of nonadditivity of effects is less commonly considered than the hetero- 
geneity problem. Tukey (194) showed how to adjust for the effects of 
nonadditivity by utilizing one degree of freedom associated with the 
error variance. 

The use of a preliminary F-test as a basis for combining mean squares 
which are thus shown not to differ significantly can lead to disturbances 
in the final F-test. A more appropriate procedure is recommended by 
Paull (156) who suggested that mean squares not be pooled unless their 
ratio on the preliminary test is less than the 50 percent value. The logical 
basis for defining main effects and interactions was discussed by Finney 
(58), and Vajda (196) considered their mathematical relationship. 

After mean square tests have been run, it is desirable to make compari- 
sons between means of treatment contrasts. Much confusion is apparent 
regarding how to proceed with this task. Tukey (190) presents a simple 
and definite method based on the idea of dividing treatments into dis- 
tinguishable sets. Studentized tables which are useful in testing a single 
treatment (best or worst) out of a group have been given by Nair (144). 
In a large factorial experiment, some mean square is expected to be 
significant just because it is the maximum of a set of mean squares. Nair 
(145) also prepared a table of significant points for this situation. 

An article by Crump (35) discussed the present status of variance com- 
ponent analysis. Information regarding explicit component variances is 
extremely useful and valuable in educational and psychological research. 
Such information can be used, for example, to estimate desired classroom 
size in order to maximize individual differences. Cameron (22) and Hen- 
dricks (87) used variance components to determine economical sampling 
plans. 
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The theoretical problems of component analysis deal with the estima- 
tion, distribution and tests of hypotheses of the components in analysis of 
variance models. In a two-way classification with equal subgroups, one 
model is Xi jx — pg + Pi + Yi + (py) 43 + €ijk where i = 1, 2. Seee 8S 
j=1, 2,...,¢; and k= 1, 2, ..n; and all terms in the expression for 
Xijx which are to the right of the equal sign are constants except ¢;;, 
which is a normally distributed random variable with a mean of zero and 
variance equal to that of the population of subjects with treatment effects 
eliminated. In other models either all or a group of the terms may be 
random variables. The mixed model was discussed by Johnson (95), and 
a very general model was proposed by Tukey (192). A studentized form 
of the fiducial limits of variance components is unknown, but a method of 
obtaining approximate limits has been investigated by Bross (19). An 
empirical study by Comstock and Robinson (31) showed, at least for 
their data, that the limits proposed by Bross were satisfactory. Smith 
(184) pointed out that, for data with proportional subclasses, a linear 
function of the rows of the analysis of variance table is required to esti- 
mate the individual components. Tukey’s papers (191, 192) provided an 
interesting and new theory on component analysis and, incidentally, 
offered a challenge for better experimentation in the social sciences. 

Recent trends point to the increased use of nonparametric statistics in 
the design of experiments. Evidence of this is Hartley’s (83) article on 
the use of the range in the analysis of variance. Some loss of efficiency 
and power results from the use of these procedures instead of conventional 
methods, but in cases where time is important and data plentiful such 
methods are welcomed. On occasion, transformation of data is necessary 
to meet the assumptions underlying a statistic or its distribution. Mueller 
(141) has provided a long list of such transformations, but omits some 
important ones, such as Fisher’s z and the probit. Bruner, Postman, and 
Mosteller (20) treated the transformation of data distributed in the 
Poisson form. 


Sampling Theory and Practice 


In his presidential address before the United Nations Sub-Commission 
on Statistical Sampling, Ronald A. Fisher spoke of modern sampling as 
“the most adaptable, rapid, economical, and in the true sense, scientific 
method of practical ascertainment which we now possess.” Developments 
during the period concerned can be mentioned only briefly in this review. 

Marks (126) presented some sampling problems in educational re- 
search, discussed some common misconceptions, and illustrated the tech- 
nics of simple random sampling, cluster sampling, and sampling with 
unequal probabilities. Perhaps in no aspect of educational research is 
the gap between modern principles and practice as great as in sampling. 

Madow (122) extended the results of earlier papers to the systematic 
sampling of clusters of equal and unequal sizes. Cluster sampling is of 
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great practical importance in educational research. Hansen and Hurwitz 
(77) indicated a method of determining the probabilities of selecting 
sample units which minimize the variance of the sample estimate for a 
fixed cost. Hendricks (86) presented derivations of some random sampling 
distributions and discussed various problems in sampling. Fog (60) 
showed how multi-dimensional geometrical proof of certain distributions 
may be translated to analytical form. Quenouille (168) considered the 
relative accuracies of systematic and stratified random sampling in one 
dimension. He also showed how to estimate the variance of the mean of a 
systematic sample by using sets of q systematic samples with randomly 
chosen starting points. Tukey (195) illustrated how results in the theory 
of sampling from finite populations can be obtained very easily by using 
Fisher’s & statistics. 

Ghosh (65) introduced “space correlation” as a criterion for random- 
ness in the two-dimensional case. Das (37) gave a set of sufficient condi- 
tions under which stratified sampling is more efficient than random 
sampling. He also gave a sufficient condition under which systematic 
sampling is more efficient than stratified sampling. Marcuse (125) derived 
formulas for optimum allocation in nested sampling with k levels. Evans 
(53) gave useful condensations of the variance equations for estimates 
of the mean obtained from stratified proportional and optimum allocation 
samples. 

Goodman and Kish (68) developed procedures by which the proba- 
bilities of selection for preferred combinations are sharply increased. The 
theoretical basis for the methods is stated and the procedures are applied to 
a specific problem. Keyfitz (106) described a device for changing to a new 
set of probabilities, that is, a device for adjusting probabilities so that 
the selected unit is chosen with probability proportional to the present 
measure of size with the new unit the same as the old one in as many of 
the strata as possible. Patterson (155) discussed appropriate methods of 
estimating means on different occasions and changes in these means. This 
was done with respect to the method of sampling on successive occasions 
by the partial replacement of sampling units. Kish (108) described a 
procedure for selecting objectively one member of the household as a 
respondent. 

Birnbaum and Sirken (12) presented a technic for the treatment of 
errors introduced into sampling surveys due to the nonavailability of 
respondents. Birnbaum, Paulson, and Andrews (11) considered the situa- 
tion where samples are obtained under circumstances making it less likely 
to draw individuals from some parts of a population than from other 
parts. They introduced a method whereby under certain assumptions, it 
is possible to correct for the resulting bias and thereby to reconstruct the 
means, standard deviations, and correlation coefficients of the original 
population. Birnbaum (9) described a method for treating the practical 
problem where selection from a bivariate normal population is per- 
formed by retaining only the individuals for whom the value of one 
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variate is at least equal to a specified cutting score, while the other 
dichotomized variate is observed only in the selected population. It js 
desired to determine how a change of the cutting score will affect the 
ratio, i.e., fraction “successful.” Hansen and others (78) formulated 
explicitly a mathematical model for response errors. In this model non. 
responses in censuses as well as in sample surveys were considered as 
response errors. Marks and Mauldin (127) compared four differen: 
procedures for measuring and evaluating response errors. 


Multivariate Analysis 


Under relaxed conditions, R. Von Mises first generalized the problem 
of classifying an individual on the basis of p variables into one of k > 2 
populations on the basis that the procedure should maximize the minimum 
probability that an individual should be assigned to the population of 
which he is actually a member. This is a direct formulation of the minimax 
principle which has been recently exploited by Wald (202) who developed 
a statistic for this classification problem. However, even in simple cases 
its sampling distribution is difficult to obtain—even asymptotically. This 
situation is discussed by Harter (81). Anderson (1), using the minimax 
principle, has worked out a numerical example using three populations. 
Assuming the existence of a priori probabilities, which are later estimated 
from the sample, Hoel and Peterson (90) gave two theorems which pro- 
vide an “optimum” criterion for large sample discrimination between two 
or more populations. Birnbaum (8) and Birnbaum and Chapman (10) 
have shown that the classification of an individual is often preceded by a 
series of inference problems which must be met before the original 
classification problem can be solved. This amounts to a multi-decision 
arrangement. 

The utilization of minimax solutions has been studied by Rao (177). 
In this paper he also considers the problem of classifying the populations 
themselves into some significant system. In another paper, Rao (174) 
pointed out the logical questions associated with discrimination and tests 
of hypotheses. In the former case there exists the problem of controlling 
right and wrong decisions, while tests of hypotheses are concerned with 
rejecting a null hypothesis with a given coefficient of risk. Roy (179) 
presented a number of practical applications of some concepts developed 
in connection with composite hypotheses which may be extended to 
multivariate normal populations. 

The logical and mathematical relationships between certain multivariate 
technics, such as principal components, canonical correlation and dis- 
criminate functions have been discussed by Tintner (188). The discrimi- 
nate function by itself provides some unique problems which were dis- 
cussed by Rao (172). He concluded that whatever the number of char- 
acters chosen to discriminate between two populations, it is most profitable 
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to have the samples from each population of equal size. Moreover, in 
order to attest the significance of the metric (Mahalanobis’ D?) between 
them, it is advisable to have individuals in the samples as nearly alike 
as possible on all chosen characteristics. This position differs from that 
of Cochran and Cox (30) who chose to select the sample units randomly. 
More is being learned about the effects of increasing or decreasing the 
number of variables in a discriminant problem. Contributors to this 
subject were Quenouille (164, 167) and Rao (170, 172, 175). 

The application of multivariate analysis to a variety of problems con- 
tinues to be fruitful and frequent. A review, extension, and discussion of 
the methods of multivariate experimentation was given by Quenouille 
(166). Johnson (98) used the discriminant function to determine a set of 
optimum weights in a grading problem, and Jackson (94) found that it 
enabled him to determine the relative values of the tests used in a selection 
process. Co-variance analysis may be used with discriminant functions 
as shown by Cochran and Bliss (29). 

A particularly useful article showing all phases of an investigation 
involving the classification and metrization of neurotic groups of soldiers 
was written by Rao and Slater (178). A similar study might well be 
undertaken using different mental illness classes and the subtests of the 
Minnesota Multiphasic Inventory. The usual, inefficient and probably 
inaccurate, method of profile analysis associated with this battery and 
others like it, could be abandoned, at least for research purposes. A gen- 
eral anthropometric survey using the generalized distance function was 
conducted by Mahalanobis, Majumdar, and Rao (123). The calculation 
required in such problems ordinarily involves the inversion of matrices 
which is laborious when the orders of matrices approximate 10. Rao (170) 
tells how this inversion can be eliminated by using the transformed vari- 
ables. With these, D? can be calculated as a simple sum of squares. 

Other varied uses of multivariate analysis included Peel’s (161) method 
of estimating battery reliability by means of a canonical correlation. 
Johnson and Fay (99) gave the theoretical basis and simplified the cal- 
culation method for determining a region of significance by the Johnson- 
Neyman technic. Another application was that made by Box (18). 

Normality, linearity of regression, and equality of variance-co-variance 
matrices are the usual assumptions in multivariate problems. Tests for 
these assumptions are known. Box (17) gave a general distribution 
theory for the last of these assumptions. Special tests are given by Votaw 
(197) and by Votaw, Kimball, and Rafferty (198). For a discussion of 
significance test problems arising when considering the deletion or addi- 
tion of variables in a discriminant analysis, see Rao (175). Lubin (120) 
considered nonparametric tests of the linear discriminant function as well 
as some aspects of the case of nonlinear discrimination. The distribution 
of characteristic roots has been studied by Nanda (147), and a matrix 
derivative theory which assists and simplifies derivation problems in 
multivariate analysis has been proposed by Dwyer (47). 
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CHAPTER VI 


Computational Technics 


NICHOLAS A. FATTU 


Since the last Review, publications in the area of “Computational 
Technics” seem to be concentrated more than ever on high-speed electronic 
developments. It appears to be almost a matter of status maintenance in 
foreign countries as well as this one to have such a device around or 
in the making. Military agencies still possess the major share of these 
devices, but industrial concerns and universities are increasingly moving 
in this direction. Altho educational research has not been affected by these 
developments so far, it is quite likely that the next Review will report 
several such applications, 

The majority of references available on computational developments 
could not properly be listed here. Many were so specialized in terms of 
technical electronics as to be unsuitable. Others dealt with the Monte 
Carlo method, and other devices, which have little foreseeable application 
to educational research. For each of the references included, there is a 
possible meaningful relationship to educational research. 

Computational developments are summarized under the following head- 
ings: bibliographies, computers, general discussion, digital computers, 
analog computers, brain-machine analogy, tables, graphs and charts, 
punched card technics, computational methods and formulas, and social 
impact of these developments. 


Bibliographies 


The journal, Mathematical Tables and Other Aids to Computation, 
summarized references on all of the headings listed above, and continues 
to be the best single source of information. Other useful bibliographic 
sources are: Mathematical Reviews; the 1949 revision of the Bibliography 
on the Use of IBM Machines in Science; Statistics and Education (64) ; 
the National Bureau of Standards Bibliography on Automatic Digital 
Computing Machinery (132); and the bibliographies given by Berkeley 
(10), Engineering Research Associates (35), Patterson (106), and The 
Electronic Engineering Master Index (116). Other bibliographies are 
those by Hartree (54, 55), Ferris and others (36), Davis (28), Adams 
(2), and an anonymous source (70). Much of the work on computing 
machinery is done by teams of researchers and is published either anony- 
mously, or by listing only the name of the research organization. 

The bibliographic sources Education Index, Reader’s Guide to Periodi- 
cal Literature, Psychological Abstracts, Biological Abstracts, and Chemical 
Abstracts are of little, if any, help in this respect. 
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Computers—General 


Computational machines are generally classified either as digital or as 
analog computers. A digital device works with discrete values, and the 
results yielded are expressed in digits. The precision of a digital machine 
depends upon the number of digits it can handle. An abacus, a commercial 
desk calculator, and an IBM sorter and tabulator are digital machines. 
The “word length” (number of digits) handled by these machines varies 
from four binary digits of the “Simon” (10, 12), to 23 digits of the 
Mark I, 10 decimal digits of the ENIAC, 19 decimal digits of the SSEC, 
44 binary digits of the EDVAC, 40 binary digits of the IAS machine, and 
35 binary digits of the SDC which can, however, handle 70 binary digits 
as pairs of standard numbers. 

An analog computer is one in which numbers are converted, for pur- 
poses of computation, into physically measurable quantities such as 
lengths, voltages, or angles of displacement. Computed results are ob- 
tained by the interaction of moving parts or electrical signals in such a 
manner as to solve an equation. The precision of results depends on the 
precision with which the device was fabricated, the skill and uniformity 
with which it was operated, and the precision with which the final result 
( was read. A slide rule, a computing chart, and an IBM test scoring 
machine are examples of analog computers. 

Many analog machines are currently in military use as fire control 
devices. These machines receive their input from and feed their output 
to other machines without intervention of an operator. 

Good general discussions of calculating machines are found in the 
following sources which are listed in their estimated order of complete- 
ness, beginning with the most general: Ridenour (114, 115), Berkeley 
(10), Richtmeyer and Metropolis (113), Hartree (54), Booth and Britten 
(17), Huskey (63), Hartree (55), and Engineering Research Associates 
(35). 

Development of these computers has necessitated building up a termi- 
nology of new words and of current words used in specialized senses. 
Different groups have tended to develop their own vocabulary. In order to 
insure uniformity of meaning, the IRE (69) has set up standard definitions 
of terms ranging from “access time” to “write.” 


Automatic Digital Computers 


Digital computers of the sort represented by the familiar adding ma- 
chine or the desk calculator are very old. The first practical adding 
machine, based on number wheels, was devised by Pascal in 1642. A 
machine capable of multiplication was designed by Leibnitz in 1671 and 


built in 1694 (114). Desk calculators and IBM machine developments 
are discussed elsewhere in this review because they are not automatic. 
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In the automatic machine, sufficient instructions are given the machine at 
the beginning, and the machine has sufficient capabilities within itself, 
to perform a whole complicated sequence of operations on many numbers 
without any outside assistance. The idea of the automatic machine itself 
was developed by Babbage before 1835. The Babbage analytical engine 
was to consist of (a) a “store” to record information in a set of registers, 
(b) a “mill” where arithmetic operations were ground out, and (c) an 
unnamed portion consisting of transfer mechanisms. Control of the ma- 
chine was to be carried out thru a sequence of punched cards. When the 
IBM Company built the Harvard Mark I, it was hailed as “Babbage’s 
dream come true” (55). 

Major assemblies of automatic digital machines may be distinguished 
as follows: 


l. The arithmetic unit which performs the individual arithmetic 
operations as required by the schedule. Details of this unit were de- 
scribed by Hartree (55), Berkeley (10), Engineering Research Asso- 
ciates (35), the MDL Staff (94), the Electronic , Be womnnaes Staff 
(34), Kjellberg and Neovius (81), Eckert (33), Wilkes and Renwick 
(144, 145), Kilburn (80), Bliss (15), West and DeTurk (138), and 
Shaw (123). 

2. The inner storage or memory unit which registers numbers enter- 
ing the problem or intermediate results and keeps them accessible for 
use as the computation requires. Orders or instructions to the machine 
that govern the sequence of the computation are also stored here. 
Storage devices are generally of three types: 


a. acoustic delay lines in which data are stored in a mercury tube 
in the form of acoustic impulses 

b. magnetic materials (wire, tape, or drum surface) where data 
are stored as variations in the magnetic state of material 

c. cathode ray tubes and insulating screens where data are stored 
as a static charge distribution. Details of these devices and others may 
be obtained in Wang and Woo (136), Hartree (54, 55), Cohen (24), 
Haeff (47), Engineering Research Associates (35), Berkeley (10), 
Kilburn (80), Bliss (15), Williams and Kilburn (146), Wilkes and 
Renwick (144), Huskey (62), and Booth and Britten (17). 


3. The control unit which keeps track of the calculation, determines 
which operation will be performed next, and causes it to be performed. 
Descriptions are given by Berkeley (10), Engineering Research Asso- 
ciates (35), Hartree (55), and Shaw (123). ; 

4. The input-output unit thru which the machine is supplied initially 
with data and instructions constituting the problem. Intermediate or 
final results can be read out of the machine. Often the input-output 
unit is used to provide a large capacity low speed storage for numbers 
and orders. This outer memory thus supplements the inner memory. 


When a mathematical problem is to be presented to the machine, it 
must be programmed or broken down into the individual steps to be. 
performed sequentially. The input units feed the program of numbers 
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and orders into the memory unit of the machine at the beginning, and 
the work thereafter proceeds entirely within the machine. 

Orders are expressed in a number code and are treated as numbers so 
far as the machine is concerned. Thus the memory locations (addresses) 
can store either a number or an order, indifferenth , and the arithmetical 
operations can be performed on the order pseudo-numbers, thus changing 
one order into another when this is required as the computation progresses. 
Details of coding and programming are discussed by Berkeley (10), 
Goldstine and Von Neumann (44), Hartree (54), MDL Staff (94), Wilkes 
(141), and Wilkes and Renwick (144). 

The electronic machines currently in use and under construction use 
the binary system of numbers for internal working. This necessitates a 
transformation of numbers from decimal representation to binary repre- 
sentation at the beginning and end of a calculation, but in many instances 
the machine itself can perform this operation. For details of this proce- 
dure, see Shaw (123), Hartree (55), Engineering Research Associates 
(35), Berkeley (10), Goldstins and Von Neumann (44), Wilkes (141), 
MacMillan and Stark (87), and Andrews (5). 


Some of the machines currently in use are as follows: 


ACE (Automatic Computing Engine Pilot Model) 

APEXC (All Purpose Electronic X-ray Computer) 

ARC (Automatic Relay Calculator) 

BARK (The Swedish General Purpose Relay Computer) 
Binér Automatisk Rela Kalkylator 

BINAC (USAF, Northrop Aircraft Incorporated, Binary Automatic 
Computer ) 

BTL or BELL COMPUTERS (Bell Telephone Laboratories Relay 
Digital Computers, Models I to VI) 

CALDIC (University of California, Digital Computer) 

CPEC (IBM Card Programmed Electronic Computer) 

EDSAC (Electronic Delay Storage Automatic Computer, Cambridge 
University ) 

EDVAC (Electronic Digital Variable Computer) 

FAIRCHILD SPECIALIZED DIGITAL COMPUTERS (Fairchild 
Engineering and Airplane Corporation) 

HARVARD (MARKS t to IV) 

IAS DIGITAL COMPUTER (Institute for Advanced Study Computer; 
at one time called the MANIAC) 

THE INSTITUTE BLAISE PASCAL MACHINE 

IBM AUTOMATIC SEQUENCE CONTROLLED CALCULATOR 
(Harvard Mark I) 

IBM (Pluggable Sequence Relay Calculator 5 built) 

KALIN-BURKHART LOGICAL-TRUTH CALCULATOR 

MAGNETIC DRUM CONTROLLED COMPUTER (Constructed by and 
for the University of Illinois—like the ORDVAC) 

NAREC (National Research Laboratory Computer) 

OMIBAC (General Electric Ordinal Memory Inspecting Binary Auto- 
matic Computer) 

ORDVAC (Similar to the IAS Digital Computer built for the Army 
by the University of Illinois) 
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RAYTHEON DIGITAL COMPUTER (Small Binary-Octal Calculator) 
SEAC (National Bureau of Standards Eastern Automatic Computer) 
SEC (Simple Electronic Computer) 

SSEC (IBM, Selective Sequence Electronic Calculator) 

SWAC (National Bureau of Standards, Western Automatic Computer) 

SIMON (A small scale, simple digital relay computer described by 
Berkeley and constructed by Porter, Jensen, and Vall) 

SWISS FEDERAL INSTITUTE OF TECHNOLOGY, INSTITUTE FOR 
APPLIED MATHEMATICS, SEQUENCE CONTROLLED COM- 
PUTER (similar to the Bell Computer Model 5) 

TELECOMMUNICATIONS RESEARCH ESTABLISHMENT COM- 
PUTER (General-purpose High-speed, Parallel, and Binary Ma- 
chine) 

UNIVAC (Bureau of the Census Computer) 

UNIVERSAL HIGH-SPEED DIGITAL COMPUTING MACHINE (At 
the University of Manchester) 

WHIRLWIND I (Massachusetts Institute of Technology) 


Analog Machines 


Interesting analog machine developments may be summarized as 
follows: 

Electronic differential analyzers of moderate accuracy, low cost, high 
speed, and flexibility are described by Macnee (88), Engineering Research 
Associates (35), and Hartree (54). Mechanical-electrical developments 
are described by Meyerott and Breit (97), Reid and Stromback (109), 
and Walker (134). 

An economic application is described by Morehouse, Strotz, and Horo- 
witz (100) in investigating economic dynamics of inventory oscillations. 
Paschkis (105) presented a thermodynamic application which bears an 
interesting parallel to research problems in education. Mathematical 
solutions for any but the most simple cases are complicated and so time- 
consuming as to make application of thermodynamic laws in industry 
impractical. If the various branches of applications are to be developed 
from their present state of being mainly an art and put on a more quanti- 
tative basis, this problem must be solved. A practical means of bridging 
the gap between theory and practice, he suggests, is provided by the 
analog computer. 

Some of the analog machines currently in use are the AEROCOM 
(Northwestern Institute of Technology), MADDIDA (Magnetic Drum 
Digital Differential Analyzer), the RCA Linear simultaneous equation 
solver, the MIT differential analyzer, the harmonic analyzers and synthe- 
sizers, the REAC differential analyzer, and the various electrical network 
analyzers. 

In educational research, many of the statistical problems handled by 
graphical computation could be better done by analog computers. , 
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When war surplus gears, vacuum tubes, relays, and other components 
become available again at a reasonable price, educational researchers 
should be able to build such devices. Hartree (55), built an analog com- 
puter from standard Meccano parts and was able to obtain accuracy 
comparable to the standard ten-inch slide rule. 


The Brain-Machine Analogy 


In their enthusiasm over new developments, several writers including 
Berkeley (10), Ashby (7), Wiener (139, 140), have compared automatic 
computers with brains. Wiener has perhaps been the most publicized. 
His application of the term “cybernetics” has given the matter an 
esoteric air. Less enthusiastic but more realistic comparisons have been 
made by Hartree (54), Engineering Research Associates (35), Jefferson 
(71), Ridenour (114), and Turing (131). 

McCullock and Pfeiffer (92) compared operations of large scale com- 
puters with known operations of the human brain. McCullock (91), 
used engineering terminology to show how the brain could be likened to 
a digital computing machine consisting of 10 billion relays whose per- 
formance was governed by inverse feedbacks. Subsidiary networks secured 
invariants or ideas, predictive filters enabled individuals to move toward 
the place where an object would be when they got there, and complicated 
servomechanisms enabled individuals to act. Disorders of functioning 
were described in terms of damage to structure, improper voltage of 
relays, and parasitic oscillations. Ridenour (114) indicated that the most 
complicated computer built so far has about 10 thousand elements, thus 
a factor of a million still separates machines from brains. McCullock 
(92) indicated that the most complicated machine built so far has approxi- 
mately the complexity of an earthworm’s nervous system. In only one 
respect can the machine be considered superior. The individual elements 
of the machine operate about a thousand times faster than do the neurons 
of the central nervous system. 

In terms of space and power, the comparison is interesting. The brain 
dissipates less than 25 watts even when it is in full activity. The ENIAC, 
a million times less complicated, fills a large room and uses 120 kilowatts 
of power. Twenty kilowatts more are needed to run the blowers which 
keep it cool. 

Another feature which helped raise the brain myth with reference to 
these machines is their ability to handle logical relationships. Binary 
arithmetic is the arithmetic of logic in a universe where a proposition is 
either true or false. Logical operations in the same formal terms as are 
used in arithmetic manipulations can thus be conducted. The Kalin- 
Burkhart logical-truth machine is described by Berkeley (10). Shannon 
(121) developed an algebra of automatic switching circuits. Shannon 
(120) also developed plans for a chess-playing machine. Berkeley (11) 
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related symbolic logic to the non-numerical reasoning operations of auto- 
matic computers. McCallum and Smith (90) described a small logical 
computing machine consisting of seventeen relays and one stepping 
switch. Abbott (1) indicated the coding and programming required on 
the CALDIC for computation of a truth table of a sentence. 

Wiener (140) and McCullock (91) remind us of the scientific impact 
of improvements in communication technics. Macrae (89) discussed 
cybernetics and social science and drew parallels between the machines 
and the brain. Jefferson (71), Turing (131), and Stibitz (127) cautioned 
against establishment of superficial isomorphisms between the machine 
and the brain. Where properly qualified, the analogy between the brain 
has been a useful aid to understanding of the workings of the computer, 
but as Hartree (54) points out the machine cannot originate anything. 
It must be told in great detail what and how to perform. It can only 
carry out a routine operation which has been programmed for it. 


Tables 


A comprehensive bibliography of tables is currently found in the 
“Recent Mathematical Tables” section of Mathematical Tables and Other 
Aids to Computation. Only one bibliography, Davis and Fisher (28), was 
published duri z the period reviewed and this bibliography is of doubtful 
value to educational research. 

Useful statistical tables should include the Hartley and Pearson (52) 
table for the t-distribution, and the (53) tables of the chi-square integral 
of a cumulative Poisson distribution. The Bureau of Standards (19) 
binomial probability table gives individual terms of the expansion of 
the binomial and partial sums. An American third edition of the Fisher- 
Yates tables (37) was published. 

Du Mas (32) cited criticisms of other methods of testing the significance 
of rho and suggested a new method. Application was simplified by a 
table and an abac. Swineford (128) presented two tables designed to 
determine the smallest common WN for each of two samples when testing 
the hypothesis that the difference in two population parameters was at 
least D. She assumed that the sample proportions were distributed nor- 
mally and that the appropriate test was one-tailed. The Comrie (25) 
compendium of more than 50 tables is assembled with characteristic 
thoroness and meticulous care. 


Graphical Devices 


Adams (2) prepared an index of 1700 references to alignment charts 
appearing in about 100 technical journals. Charts are listed alphabetically 
under key works and categorically under 21 divisions. 

The most interesting reference in this area is that of Mosteller and 
Tukey (101), which described the uses of binomial paper. This binomial’ 
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paper was designed to facilitate use of Fisher’s inverse sine transforma. 
tion for proportions, or to adjust binomially distributed data so that the 
variance will not depend on the population value of the proportion, p, 
but only on sample size, NV. Most tests of counted data can be made quickly. 
Almost no calculation is required. Twenty-two examples are given. 

Reynolds (112) presented a nomograph of an approximation formula 
for correcting interfunction correlation coefficients for heterogeneity. 
Jenkins (72) developed a short-cut graphical approximation to R which 
is said to have yielded a mean discrepancy of .005 with the Doolittle 
solution. Jenkins (73) also presented a single chart for tetrachoric 
designed as a substitute for the out-of-print Thurstone diagrams. Hamilton 
(48) developed a nomograph for obtaining tetrachoric correlation coeffi- 
cients. Results were generally accurate to three decimal places when the 
more uneven dichotomy was no greater than 70 and 30 percent. Goheen 
and Davidoff (43) provided a graphical method for rapid calculation for 
the biserial and point biserial correlation. Their diagram is entered with 
the mean criterion score of the group passing the item and the propor- 
tion of correct answers to the item. It was accurate to the second decimal. 

Orcutt’s (103) new regression analyzer consists of devices for obtain- 
ing sums of various cross-products of values fed into it on punched cards. 
The device should be useful for product moment and serial correlation. 
A digital electronic correlator was discussed by Singleton (124). 

Differences between percentages in terms of computing charts were dis- 
cussed by Lawshe and Baker (83), and Swineford (128). Hsii (60) 
described a gadget which provided graphical determination of the standard 
error of a difference and of the correlation coefficient. 

A device for computation of the first four moments about the mean 
was described by Lyman and Marchetti (86). This consisted of a box 
containing a number of endless tapes arranged so that a selected line 
on each tape could be read thru a window in the box. A given distribu- 
tion was divided into equally spaced intervals. Each tape corresponded 
to an interval, x, and each line corresponded to a possible frequency, /, 
for the interval. On this line, fx to fx* were typed. Kahn and Suchman 
(78) discussed scalogram boards for testing scalability of a series of 
items or questions answered by 100 individuals. 

Developments with reference to graphical devices were spotty. It 
appears likely that a little better acquaintance with computing mecha- 
nisms will divert much of the effort from graphical devices to mechanical 
and electronic gadgets. These could be made at relatively low cost, could 
compute more rapidly, and could be read more easily. 
















Punched-Card Technics 


In the literature reviewed, considerable attention was devoted to various 
punched-card technics. General discussions and bibliographies of the 
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methods were provided by Engineering Research Associates (35) and 
Berkeley (10). The IBM bibliography (64) listed references dealing with 
punched-card applications in scientific research, statistics, and education. 
Perhaps the most noteworthy contributions in this respect were the 
various proceedings of the IBM seminars (65, 66, 67, 68). 

The following applications deserve mention: Welker’s (137) discussion 
of correlation and regression analysis; Krawitz’s (82) and Monroe’s (99) 
application of machine methods to analysis of variance and multiple 
regression; general applications by Southworth and Bachelder (125), 
and by Brandt (18). Industrial applications of interest to computers 
were discussed by Bell (9), Birnbaum (14), Curtiss (27), Donsker and 
Kae (31), Grosh and Usdin (46), Hazel and Lush (58), Kelly, (79), 
Lowe (84), Luckey (85), Metropolis and Ulam (95), Oakley and Kim- 
ball (102), and Verzuh (133). 

Castore and Dye (23) discussed a simplified method of determining 
sums of squares and of products, and Castore (22) suggested a method 
of grade-point prediction suitable for use with large groups. Kahn and 
Bodine (77) developed a means of Guttman scale analysis by IBM equip- 
ment which overcame the limitations inherent in the scalogram boards in 
terms of number of items and number of subjects used. 

In general there seemed to be a healthy development of punched-card 
technics, but these were largely directed toward industrial and technical 
applications. A seminar or symposium, patterned after the Endicott 
Conferences, would seem to be most beneficial for educational research. 

















Methods and Formulas 


There were no general discussions or summaries of methods and for- 
mulas during the period surveyed. Scarborough (118) discussed numerical 
analysis particularly for engineering and technical scientific applications. 
Pease (107) prepared a manual on the use of desk calculators in elemen- 
tary statistics. The use of the Friden, Marchant, and Monroe machines 
was described. 

Procedures for the rapid calculation of standard deviations were sug- 
gested by Woolf (147), and a modification for rapid estimation of the 
standard deviation and for estimation of the square root of the sums of 
squares of differences was described by Marriage (93). 

Adkins (3) prepared a note on the computation of correlation coeffi- 
cients. Spurr (126) described a short-cut measure of product moment 
correlation which began with a scatter diagram and regression curve. 
Computation was done by measuring the ranges about the simple regres- 
sion curve and the range about the mean of the dependent variable that 
includes two-thirds of the items. The ratio of the two is approximately 
the coefficient of alienation which in turn may be used to enter a table 
for the value of r. Aldridge, Berry, and Davies (4) described a rapid - 
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calculation of the slope of the linear regression line for the case in which 
intervals of the independent variable were equally spaced. 

Carroll and Bennett (21) developed rapid routines for the computation 
of chi-square tests of goodness of fit and also of independence in both 
rx 2 and rx c tables. Jowett (76) presented a routine for calculation of 
sums of squares and products on a desk machine. 

Davis (29) indicated that traditional methods of calculating psycho. 
physical parameters from percentage data left much to be desired since 
their sampling errors were not known. He suggested that if the regression 
of transformed proportions on the stimulus scale were calculated, the 
appropriate standard error would be the standard error of estimate. 
Workman and Adams (148) used a printed sheet to simplify computation 
of Vincent curves. 

Jordan (75) adapted a useful iteration scheme for solution of second 
and higher orders of an algebraic equation to the extraction of roots. 
Applied to the square root problem, the method yielded an answer correct 
to fourteen decimal places on the second iteration. 

Bancroft (8) listed formulas for obtaining probability values for 
common tests of hypotheses, and indicated the scope and utility of each. 
Worked examples, using formulas derived by the author, demonstrated 
how to obtain probability values outside the range of available tables. 
Burke (20) showed how the level of significance of any obtained F-ratio 
could be determined from tables of the Incomplete Beta Function. In this 
he duplicated a portion of what was discussed by Bancroft. 

Rulon (117) described matrix representation models for the analysis of 
variance and co-variance. Tukey (129, 130), by using dyads and matrix 
representation extended the properties of the usual analysis of variance 
(for one dependent variable) to an average for several independent vari- 
ables. An example indicated the choice of a metric for the analysis in 
terms of the test of significance and estimation. Schmid (119) compared 
two procedures for calculating discriminant function coefficients. Johnson 
and Fay (74) illustrated a simplified working procedure for the Johnson- 
Neyman technic. 

Walsh (135) described tests of the population median having signifi- 
cance levels which are either exact or bounded under some very general 
conditions. These statistics were found very efficient for small samples 
from a normal population. They can be applied with little computation. 
An important use was described as a substitute for the corresponding 
t-tests, since order statistics were more easily applied, valid under more 
general conditions, and approximately as efficient as the corresponding 
t-tests. Freeman and Tukey (41) reported an empirical study of a number 
of approximations, some intended for significance and confidence work 
and some for variance stabilization. 

Fox (39) described the details and layout of a direct method of com- 
putation for the inversion of matrices. The theory is not new but is, in the 
opinion of the author, the best available for the average computer using 
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ordinary desk calculating machines. Hartree (56) described a useful 
iterative process. Green (45) showed that the weighted composite needed 
to combine tests for maximal reliability was the first principal axis of 
a matrix closely related to the correlation matrix. He suggested a simple 
and direct procedure for use instead of the cumbersome standard method 
of computing these weights. 

Miller (98) discussed checking of tables by differencing. For this check 
to be fully satisfactory, it must be as nearly independent, as possible, of 
the original computation. The location and detection of isolated blunders, 
the recognition of blunders made during differencing and the detection 
and location of coupled or multiple blunders were discussed. 

Developments contributed by educational statisticians were aimed at 
systematizing and simplifying the arithmetic labor involved in various 
computations. Developments contributed by mathematical statisticians 
were aimed at reducing the amount of arithmetic computation required 
for making frequently used tests of significance. 


Social Impact and Future of These Developments 


In the none too distant future it is likely that a substantial group of 
universities and social science research organizations will have access to 
automatic computing equipment. The IBM corporation is mass producing 
a group of such computers on a trial basis. While these are all being 
assigned to high priority agencies at the present time, it should not be 
too long before they become more generally available. 

The electronic digital computers will be useful in the investigation of 
problems involving multivariate analysis and studies involving slowly 
converging series. Analog computers could be constructed at any time 
the cost of parts begins to fit educational research budgets. Such analog 
computers could be set up to give almost instantaneous values for the 
various routine statistical equations and fitted curves in present use. Much 
of the work done by graphs, tables, and charts also could be better done 
by simple analog computers. 

To be able to use these devices effectively, it is necessary to develop new 
ways of thinking about educational research problems. There is a need 
to consider the study of many factors simultaneously and the possibilities 
of the use of long-term data. Mathematical models developed for such 
research should be more than rehearsals of current multivariate analysis 
procedures. It is to be hoped, so far as educational research is concerned, 
that these basic developments can keep abreast of the technical advances. 
The potential is here, but adapting these new technics to research remains. 
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CHAPTER VII 

in 

Observational Procedures Used in Research : 

SAUL B. SELLS and ROBERT W. ELLIS 

Odsservarionat Methods” has been treated in the two previous reviews 
under this title (84, 85) as a broad field including all research technics 


employing observation, both direct and indirect, in collection of data. 
Thus considered, it embraces various methods of direct observation includ- 
ing instrumental aids; methods utilizing ratings, interviews, case histories 
and biography; questionnaires and surveys; as well as indirect technics 
of analyzing writings, artistic productions, and other behavioral products. 
Altho these methods are integral to many other topical areas of educa. 
tional and psychological research, the emphasis in this review is on their 
usefulness as observational technics. 

The research literature of the past three years reflects an increased 
concern with method. Perhaps the most important new contributions are 
those which attack the description of group structure, group dynamics, 
and interaction. From the standpoint of content the research of this period 
shows an extensive interest in problems of social and individual adjust- 
ment. The refinement of observational technics in the approaches to these 
problems indicates much scientific progress. 

Of general interest as a valuable symposium on research methods is 
Andrews’ volume on the methods of psychology (7). This book covers a 
wide area of problems; each section discusses theoretical issues and pre- 
sents illustrative solutions by reference to outstanding published research. 
Solomon (92) and Stouffer (96), both dealing with problems of experi- 
mental design, have contributed to the literature of research methodology. 
Solomon’s observations concerning contro] groups are applicable to nu- 
merous problems of educational evaluation. Stouffer offers some practical 
scientific advice on study planning. 

Shannon (87) analyzed the personal characteristics of 1242 contributors 
to educational research literature. The tradition of expecting only pro- 
fessors and research directors to publish research was supported. Publica- 
tions were limited from women, elementary- and high-school teachers and 
older persons, as a group. 





Direct Recording of Behavior 


Several noteworthy contributions illustrate the rich possibilities of 
making direct records of behavior which can be reduced to quantitative 
data for analysis. The remarks of the senior reviewer in the previous 
review (85) concerning reliability and standardization of procedure 
should be repeated. Nevertheless, the following references illustrate good 
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observational technic. Ames (3), studying interpersonal smiling responses 
in the preschool years, recorded 150 observation periods over a two-year 
interval. The occurrence of spontaneous smiling and laughing behavior 
of children from 18 months to four years of age was recorded as they 
took part in regular nursery-school play activities. She found that the 
amount of smiling increases from one smile per child every 6 minutes at 
18 months to one every 144 minutes at four years. The ratio of laughs 
to smiles increases from 1 to 10 at 18 months to 1 to 3 at four years. 
Ames (4) also demonstrated the value of motion picture recording in 
studying developmental trends in writing and block-building behavior 
from 36 weeks to 10 years of age. Her study of 179 cases over this age span 
revealed clearly observable trends in the place of writing and in position 
of writing and nonwriting hand, which were consistent from age to age 
and from child to child. Analysis of postural sets showed that these tend 
to vary with the maturity of the action system. 

Clark (23) observed the spontaneous remarks of 12 nursery-school 
children during a four-week period, in which simple number terms were 
used conversationally, often without comprehension of their meaning. 
She noted that the nursery school, by providing concrete experience with 
time, space, size, and number concepts, can stimulate and extend their 
use and comprehension. 

Jones and Bayley (43) made use of skeletal X-rays in studying physical 
maturing of boys in relation to behavior. The “physical ages” of 90 boys 
were determined by skeletal X-ray. Contrasting extreme groups, 16 in 
each, were selected for a study of early and late maturation. Observa- 
tional data indicated that the early maturation group was rated more 
physically attractive, neater, less animated, less “affected,” less unin- 
hibited and more relaxed. Their contemporaries tended to classify early 
maturing boys as less attention-seeking, less restless, more assured, less 
talkative, more “grown-up” and more likely to have older friends. These 
data suggest that psychological characteristics are related to rate of 
physical maturing, when extreme groups are compared. 

Bossard and Sanger (14) published an interesting account of the 
observations, for a period of seven weeks, of the behavior and changes 
induced in a seven-year-old girl who moved suddenly, in company of her 
mother, from a small city apartment to an elegant country estate. The 
child, previously obedient and content with solitary play, tried to live up 
to the new environment. The sudden changes in residence and scale of 
living, with few other changes in living circumstances, were followed by 
six discernible behavior changes, as noted constantly in the detailed daily 
notes recorded independently by two participant observers. These included 
(a) marked increase in sense of possession and responsibility for posses- 
sions, (b) change in scale of personal behavior to conform to the new 
living conditions, (c) preoccupation with means of acquiring a similar 
setting, even to the point of suggesting that the mother do so thru a new 
marriage, (d) feelings of uncertainty and insecurity to the point of fear - 
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of being lost or fear of losing the mother, (e) feelings of isolation because 
of lack of constant contact, yet “hugging” the new isolation, (f) marked 
increase in verbalization. 

Continuing his studies of infant speech, Irwin (42) recorded speech 
sound data for only infants and for those with older siblings. Analysis 
of results in terms of phoneme type and phoneme frequency indicated 
that the presence of older siblings in the family has negligible effect on 
infants’ speech sound development. Fairbanks and his co-workers (29, 30) 
used phonophotography and sound frequency measurements in acoustical 
studies of vocal pitch in seven- and eight-year-old boys and girls, who 
read test passages which were recorded. Voice breaks were found in 
preadolescent girls as well as boys. Characteristic sex differences in pitch 
level were consistent. 


Studies Based on Recorded Products of Behavior 


This section is concerned with indirect observational technic. Writings, 
recorded speech, artistic productions and other products of behavior 
constitute the raw data of these approaches. 

Kaplan and Goldsen (44) studied the reliability of content analysis 
categories for the semantical analysis of documentary material. In three 
studies, using 500 newspaper headlines classified independently by eight 
analysts, judgments of strength, morality, and related concepts were 
made with high interjudge agreement. 

Watson, Breed and Posman (103) recorded 1001 remarks overheard 
in conversations among the population of Manhattan. These they classi- 
fied as to personal frame of reference (e.g., about self, other ptople, both, 
etc.), by topical reference (e.g., economic, political, recreational, etc.) 
and by place of conversation. Sex, age, and class differences were observed 
in relation to culturally assigned roles. 

Withall (107) developed a seven-category scale for analyzing teachers’ 
statements to pupils as a means of measuring the “social-emotional” 
climate in classrooms. The mean percent of agreement for four judges 
with the investigator in analyzing three typescript records of teachers’ 
statements to pupils was 65. With the aid of a mechanical device pupils 
indicated feelings of “good” or “bad” which indicated more positive 
emotional reactions in the learner-centered than in the teacher-centered 
instruction sessions. 

Luft and Wheeler (48) made both a qualitative and quantitative analysis 
of a random sample of 339 spontaneously written reader letters to John 
Hersey, all dated within two weeks of the publication of his article, 
“Hiroshima,” in the New Yorker magazine. The advantages and disad- 
vantages of personal documents of this type as indicators of public opinion 
are discussed. The report tabulated correspondents’ approval or dis- 
approval of the article as a whole. It also analyzed the concepts expressed. 
Using a similar approach on a quite different problem, Pixley and Beek- 
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man (68) analyzed about 7000 anonymous essays on religion written 
by students in the public schools of Los Angeles. Of 3676 students writing 
on church attendance, 36 percent attended regularly, 52 percent irregu- 
larly and 12 percent never. Of 3317 writing on prayer, 22 percent prayed 
for personal benefits, 19 percent to express thanks, 15 percent to talk to 
God, 11 percent for guidance, 10 percent by habit and 9 percent for 
mental comfort. 

Researchers have used personal documents. Parry (65) used excerpts 
from autobiographies of outstanding college juniors as source material in 
investigating childhood school practices and personalities. Nagy (62), 
using written compositions, drawings, and conversational records, a total 
of 484 protocols from 378 children from ages three to 10, found three 
stages of development of the meaning of death, with transitional points 
at about age five and again at age nine. Swenson and Caldwell (97, 98) 
studied developmental trends in content categories and in formal char- 
acteristics, such as spelling, in an analysis of 680 letters written by public- 
schoo! children in Grades IV to VIII. Windsor (106) studied the easel 
painting produced over a period of one year by 149 nursery-school chil- 
dren, ages two and one-half to five and one-half, from a wide range of 
social classes. Detailed case histories from parents, discussions with 
teachers and complete behavior records for two days each week were also 
collected. Paintings were analyzed for style and matched with behavior 
traits described by teachers. Windsor concluded that easel painting, used 
as a “projective technic” by trained and experienced teachers, is a valu- 
able source of information about intimate child interests and problems. 

Flanders (34), to explore the relationship between verbalization and 
learning, recorded the statements of 22 pupils in seventh-grade aritlimetic 
during 17 one-hour periods of classroom instruction. Quantity of verbal- 
ization correlated .72 with two learning tests used as criteria. 


Controlled Diary Technic 


Anastasi (5) compared three published studies of anger responses of 
college women, using the controlled diary technic. She recommended 
methodological improvements of the technic, such as specifying the 
amount of detail required in reporting each entry and control of the 
temporal sequence of reporting by different subjects. To avoid artifactual 
effects of sequential factors, she suggested that different subjects might 
begin their diary records on different days of the week. Anastasi, Cohen, 
and Spatz (6) demonstrated this technic in a study of fear and anger. 


Case History Technic 


Henry (39) analyzed case histories from a child guidance clinic as 
tho they were anthropological records from a primitive culture. His’ 
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method of systematic, line by line extraction from each sentence, of the 
appropriate cultural generalization implied, brought out interesting in- 
sights into the cultural formation of symptoms and therapy. Milner (56) 
compared case studies of 15 boys and 15 girls, ages 10 to 14, to test the 
hypothesis that early adolescents of lower-middle and upper-lower classes 
manifest certain sex-typical and group-typical personality characteristics. 
Her analysis of the case material revealed 11 group-typical characteristics, 
15 interrelated characteristics typical of girls and 13 of boys. 


Observation of Group Behavior 


Increasing importance is being attributed to group structure, dynamics, 
and interaction as problems challenging psychology, education, and 
sociology. New methods are being developed to attack these problems. 
Carr’s (18) “situational analysis” approach to sociology emphasized the 
need for the observer to detach his own background experiences from 
the phenomena observed. The same problem has long been recognized in 
psychological research. Carr’s manual is useful in its emphasis on the 
situation. Bales (8) presented a method for the study of small groups 
which he called “interaction process analysis.” His method of classifying 
direct, face-to-face interaction as it takes place and of summarizing and 
analyzing recorded data required specially trained observers, one-way 
vision mirrors, and an interaction recorder. Observed items of behavior 
were recorded according to 12 descriptive categories. The record pro- 
vided for frequency by category, time, and sequence and could be related 
to particular group action in the situation. 

Zander (109) developed an objective method for collecting data to 
make a case study of a group. His instrument consisted of a checklist of 
30 items designed for use at five-minute intervals. It covered such items 
as nature of group, setting, adult-child interactions, style of child partici- 
pation, pattern of group transitions, transmission of ideology, and others. 
Zander reported satisfactory agreement between the independent ratings 
of two trained observers. 

Soale (91) studied sources of friction among people living in rural 
communities. A list of 209 common rural problems was rated for serious- 
ness and frequency by a group of persons considered competent to 
appraise situations involving friction among people. The results yielded 
a list of sources of conflict useful in social studies classes. 

Merton (54) studied patterns of interpersonal influence and communi- 
cation behavior in a community of 11,000 people. He interviewed 86 
persons to identify the individual influential in the community. He then 
interviewed the 30 most frequently mentioned of 379 names obtained. 
Preliminary classification of interview data according to “dynamic posi- 
tion in the local influence-structure” proved unproductive. However, 
classification of cases into “local” and “cosmopolitan” types appeared 


436 











December 1951 OBSERVATIONAL PROCEDURES 





more useful and led to description of their roots in the town, sociability, 
organizational behavior, routes by which they gained positions of influence 
and differences in interpersonal influence exerted. 


Sociometric Methods 


Among observational methods of research, sociometry has won a place 
of importance. Moreno (61), its founder, discussed experimental soci- 
ometry and the experimental method in science. This is a general dis- 
cussion of methodology. Festinger (32) has contributed to the power of 
sociometric analysis. The usual way of analyzing sociometric data is to 
graph them and determine interrelationships by inspection. This proce- 
dure is seriously limited as a function of the number of individuals and 
the number of choices. Festinger showed how matrix algebra offers a more 
efficient analytic procedure. Pepinsky (67) in an important theoretical 
paper, discussed the meaning of “validity” and “reliability” as applied 
to sociometric tests. These psychometric concepts do not apply to soci- 
ometric observations which employ direct measurement of the choice 
behavior studied. A systematic development of new concepts is suggested. 

Boorman and Springer (13) presented a psychodrama protocol of a 
student-teacher evaluating the role of principal. Their analysis showed 
how this prospective teacher generalized and stereotyped his reactions to 
the role of principal on the basis of abstractions made while job-hunting. 
Hyde and York (41), using the sociogram as a model, presented a technic 
for investigating interpersonal relationships in a mental hospital. Tem- 
poral sequence and type of interpersonal relations expressed in speech, 


movement, and attention were recorded with graphic symbols and number 
comments. 


Rating Technics 


Powell (69) compared three methods of rating of personality adjust- 
ment: self-rating, peer-rating, and expert-rating. One hundred and forty 
female college students rated themselves on the Bernreuter Personality 
Inventory; experts’ ratings were made on a seven-point adjustment rating 
scale by dormitory counselors; peer ratings were obtained by a “guess- 
who” test and a sociometric test. Intercorrelations were positive, but very 
low. The author concluded that self-diagnosis is inadequate for identifi- 
cation of individuals requiring therapy. 

Wherry and Fryer (104) undertook a study to determine whether 
“buddy ratings” are measures of leadership or merely of popularity. 
Using two classes of officer candidates numbering 82 and 52, they found 
thru factor analyses that buddy ratings made during the first month 
measured the same factors three months later. When superiors’ ratings 
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were obtained at the fourth month, they reflected the leadership factor 
which fellow students identified in the first monthly rating. 

Rotter and Wickens (81) investigated the consistency and generality 
of ratings of “social aggressiveness” based on observations of role-playing 
situations. They found that large discrepancies between raters appeared 
when the raters had no explicit definition of the trait to be rated or the 
kind of behavior to be observed in a given role. Training increased 
interrater agreement. 

Fiske (33) compared the consistency of factorial structures of per- 
sonality ratings of 128 Veterans Administration trainees in clinical psy- 
chology on 22 scales from three sources: (a) by three psychologists, 
(b) by three teammates in an assessment program, (c) by self. Four 
similar factors were found in all three sets of ratings. 

Sisson (90) studied the efficiency of the forced choice rating used by 
the United States Army. The rater selects from a group of four two favor- 
able and two unfavorable behavior items which best and least describe 
the ratee. Results based on 50,000 officer-effectiveness ratings reveal that 
the forced choice method is superior to a linear scale for these reasons: 
(a) It produces a distribution of ratings relatively free from the usual 
pile-up at the top of the scale. (b) It is less influenced by the rank of 
the officer being rated. (c) The ratings are valid indexes of the ratee’s 
real worth. 

Representative studies using rating technics are numerous. Of interest 
are those of Turner (101) who developed a scale of altruism which was 
used in the Cambridge-Somerville Youth Study; de Groat and Thompson 
(26) who investigated teacher approval and disapproval among sixth- 
grade pupils; Christensen (22) on courtship conduct of 1385 unmarried 
Mormon university students; and Schwebel and Asch (83), who analyzed 
student evaluations of nondirective teaching methods in relation to student 
adjustment. 


Attitude Studies 


Using a scale of attitude toward Negroes administered to 183 social 
science students, Kriedt and Clark (45) compared Guttman’s Cornell 
Technic of Scale Analysis with two older item analysis methods. They 
concluded that scale analysis can be very useful provided discretion is 
exercised in selection of suitable problems and handling of methods. 
Lyman (49), using two forms of his School Attitude Inventory compared 
scrambled versus blocked (according to scale) arrangements of items. 
His results indicated that scrambling was not, as frequently contended, 
superior to simple block arrangement. 

Stendler (93) made use of stories describing acts of stealing from a 
private person and a corporation in studying eighth- and ninth-grade 
pupils socio-moral judgments. This represents a valuable indirect method 
of approaching such problems. 
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Several studies were concerned with the relation between information 
and attitudes. Woodruff and DiVesta (108), investigating students’ atti- 
tudes toward abolition of fraternities and sororities, concluded that 
changes in attitudes can be brought about by changing the individual’s 
concept of the object toward which the attitude is expressed. Gleason 
(36), in a study of the relation of information to attitude toward the 
Taft-Hartley Law among industrial employees, found no difference in 
knowledge of the Act between the group opposed and that in favor. How- 
ever, the no-opinion group had less knowledge than the other two. The 
presentation of facts did not readily change beliefs. Shimberg’s (89) 
study of information and attitude toward world affairs found results in 
essential agreement with Gleason. 

Dietsch and Gurnee (27) sent leaflets against subsidization of college 
athletics at weekly intervals for five weeks to 427 college students, They 
found that the group receiving five leaflets dropped from 49 percent “abso- 
lutely yes” on a ballot on subsidization of 17.2 percent, whereas a control 
group dropped to 42.5 percent. The use of two additional experimental 
groups showed that one leaflet was as effective as three or five. Rubin- 
Rabson (82) used the Thurstone-Wang Scale for Attitude Toward Treat- 
ment of Criminals, at the beginning and end of a series of eight popular 
lectures on personality dynamics, to measure the effect of the lectures on 
this attitude. No control group was reported. She reported a significant 
shift from preference for punishment to preference for re-education. 

Cahalan and Trager (16) found that an 11-item test for anti-Semitism 
gave results different from those obtained with a free-answer question. 
The free-answer question elicited a variety of stereotypes. This question 
of method in studying attitudes is sometimes baffling and requires careful 
pretesting of inquiry forms. 

Radke and Trager (74), adapting current projective technics, presented 
colored dolls to a carefully selected sample of Negro and white primary 
pupils in Philadelphia in a study of children’s comprehension and inter- 
pretation of the social roles of Negroes and whites. Both groups, 30 per- 
cent of whites and 16 of Negroes, ascribed inferior roles to the Negro 
characters. Zeligs (110) compared sixth-grade children’s responses to 
Dutch, French, Italian, Mexican, Russian, and Negro stereotypes in 1931 
with data on the same schedule in 1944. Responses of children reflected 
the culture patterns of their social environment. Razran (75) presented 
30 fact photos of college girls to a stratified sample of 150 males, who 
rated them for liking, beauty, intelligence, ambition. and entertainingness. 
Two months later he repeated the test, this time assigning a racial tag to 
each picture. A number of significant changes were noted. MacKenzie (50) 
investigated the effect of contact in determining attitude toward Negroes, 
using a data sheet and an ll-item attitude scale. Acquaintance with 
Negroes of high occupational status and variety of contacts with Negroes 
were associated with favorable attitude. The latter was most reliably 
indicated by the willingness to eat at the same table with Negroes. Crossen . 
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(25) used an attitude test to select groups of ninth-grade pupils favoral)e 
and unfavorable to Negroes. She then administered a critical reading test 
to both groups and found that the unfavorable attitude group made signifi. 
cantly lower critical reading scores than the other. Following a similar 
procedure with respect to attitude toward Germans, comparable resu\ts 
were not found. One may ask whether the intensities of attitude measured 
thru the range cf one attitude scale can be related quantitatively to those 
of another. 

Lindzey and Rogolsky (47) also used photographs, taken from a college 
yearbook, to investigate racial attitudes. They found attitude toward 
minority groups associated with success in identifying photographs. The, 
also found a correlation of .55 between anti-Jewish, Catholic, and Negro 
attitudes, which they interpreted as a general factor of bigotry among 
prejudiced people. 

Farnsworth (31) developed five rating scales for studying musical 
interests: general interest in music, “serious,” “popular,” “hit parade,” 
and “waltz.” His intercorrelations of these scales reveal a number of 
interesting relationships. Analysis of sex differences shows girls more 
interested than boys in both “serious” and “popular” music. Chase (20) 
reported a survey of subject preferences of fifth-grade pupils, based on a 
sample of 13,483 in 65 New England towns and 2350 in a southwestern 
city, using a subject preference checklist. Reading is highest for the total 
group. Rankings of subject preference and sex differences are reported. 


Survey and Opinion Research Technics 


Altho this field has become standardized and gained much prestige in 
industry and politics, there are still many unresolved problems and uncon- 
trolled sources of error. The following reports were selected on the basis 
of their value as significant methodological contributions. Newhouse and 
Kilpatrick (64), at the University of Washington, compared the technic 
of polling student opinion by telephone with a face-to-face survey. The 
more economical telephone survey compared favorably in results as well 
as with respect to student cooperation and administrative acceptance. 
Stephan (95) presented a definitive and systematic discussion of sampling 
design which is applicable to studies of social organization and institu- 
tional behavior as well as to opinion and marketing research. Parten (66) 
developed a textbook manual outlining procedures used by surveyors in 
marketing, opinion research, census taking, audience testing and social 
research surveys. The manual covered the history and problems involved 
in constructing questionnaires, drawing samples, interviewing, coding, 
tabulating and preparing reports. Blumer (12) discussed six salient 
features of public opinion and developed the issues of sampling. His 
thesis was based on the need for recognizing the members of a population 
sample as interacting parts of a social organization rather than as an 
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aggregation of individuals having equal weight. Riesman and Glazer (78) 
developed a related argument in their paper; namely, that polls should 
get at latent meanings of responses thru analysis of the total personality 
of respondents. Wallin (102) investigated volunteer subjects as a source 
of sampling bias. In a study of engaged couples he obtained data on a 
number of characteristics on those who failed to participate as well as 
on participants. Bias was found on only one item and this did not 
materially affect the results. Manfield (51) recorded the pattern of 
responses to Veterans Administration mail surveys. Approximately half 
of the returns came in by the third day, three-quarters by the fifth day, 
nine-tenths by the tenth day. He presented control and tabulation proce- 
jures to secure comparability in the data. 


Interview Studies 


Lauck (46) used an interview approach to study the etiology of delin- 
quency. She interviewed 100 delinquent adults and a control sample of 
100 nondelinquents to obtain background information on school behavior 
record, guidance experience in schools, educational achievement, and 
present job satisfaction. This method appears to be valuable in discover- 
ing hypotheses in that it provides more detailed and contextual data than 
can be obtained by questionnaire technics. On the other hand, it is 
restricted in applicability to large samples, which may be needed for 
verification of hypotheses. Adorno and others (2) used interviews to 
throw light on parental relations, childhood experience, conception of 
self and the dynamics and organization of personality. Their book was 
based on intensive interviews with 63 children; the protocols were 
analyzed qualitatively for prejudice, political and economic ideas and 
ideology, and personality syndromes. 

Sherriffs (88) demonstrated how a personal interview by a counselor 
can be utilized to reduce tensions, direct satisfaction of needs, provide 
individualization of student treatment and improve academic achievement. 

Turner (100) used an open-end type of interview with 200 heads of 
families in studying factors associated with migration to Kalamazoo, 
Michigan, “a medium-sized American city.” 


Questionnaire Technics 


Romine (80) listed 12 criteria for improvement of questionnaires. 
They cover psychometric as well as content considerations, and should 
constitute a manual of basic principles in questionnaire construction. 
Of equal importance for general consideration of questionnaire studies 
is Shannon’s (86) analysis of percent of returns to questionnaires in 
reputable educational research. The mean percent for 433 questionnaires 
was 72. 
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Gerberich and Mason (35) compared signed versus unsigned question. 
naires using a 40-item questionnaire concerning previous training in 
biological sciences, study habits, and reactions to a current biology 
course. Their sample consisted of 2876 students divided approximately 
equally between signed and unsigned groups. The results did not reveal 
any significant differences. Campbell and Mohr (17) investigated the 
effect of ordinal position on responses to items in a checklist. They used 
a Latin Square design, with 16 checklists for 16 types of radio program, 
each type appearing once in each form and once in each ordinal position. 
Sixteen groups of 40 students each filled in each form by checking their 
five favorite types of radio program. No significant differences were found 
attributable to ordinal position, altho there were consistent differences 
among types. 

Roeber (79) analyzed seven well-known interest inventories for word 
difficulty using the Thorndike-Lorge word frequency list. Each contained 
at least 10 percent of words beyond the ninth-grade level. Two had 20 
percent above this level. It would be interesting to follow this with 
empirical studies of comprehension of the items by samples of children 
at successive grades. 


Questionnaire Studies 


In an investigation of vocational education in Negro high schools, 
Bryant (15) sent a questionnaire to urban and rural high schools in 
Texas and to 6000 Negro workers in 15 selected occupational fields. 
His findings indicated a need for greater emphasis on vocational educa- 
tion thru in-school and on-the-job training. Michal-Smith (55) admin- 
istered a questionnaire on educational career plans to 302 veteran college 
students. Of 77 percent who replied, one-third had not originally planned 
on going to college, one-fourth had no particular vocational objective 
and one-third had changed their objectives. Aaronson (1) sent 3030 
questionnaires to veterans who dropped out of school in 1946 and 1947. 
In order of frequency, those replying complained that (a) they could not 
obtain courses they desired, (b) they were unhappy about their counsel- 
ing, (c) they were unable to get into the “swing” of academic life. 

Nance (63) administered three well-known personality inventories to 
102 teachers-college sophomores. On all three tests the group mean 
exceeded the published norms for “masculinity-femininity.” Since these 
results, properly interpreted with respect to the significance of a “mas- 
culinity-femininity” score in relation to teaching, counseling, medicine, 
and various other professions requiring tenderness and considerateness 
in relation to persons in dependent roles, are very significant, they should 
stimulate further research in this area. Best (9) and Richey and Fox (77) 
investigated factors associated with selection of teaching as a profession. 
These studies should be of interest with reference to recruitment and 
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selection of teachers. Wimmer (105) used a questionnaire, sent to 447 
secondary schools, in a survey of guidance practices. McKee (53) analyzed 
the elementary-school principalship on the basis of 52 questionnaires 
sent to principals. 

Stendler (94) administered a 25-item, free-response questionnaire on 
child behavior to a group of elementary-school teachers. She evaluated 
their replies in relation to those of three mental hygienists and classified 
the data according to types of solutions proposed for specific behavior 
problems. “Talking to the child,” or moralizing, was found to be the 
favorite treatment. This study demonstrates how a questionnaire may be 
utilized as an indicator of teachers’ insight into child behavior. It also 
confirms the widely appreciated fact that too few teachers appreciate the 
need to discover deeper causes underlying the manifest behavior of 
children. Billig (10) reported two questionnaire studies of interpersonal 
relations, one among teachers, the other between teachers and pupils. 
One of his most interesting findings, reflecting a dominant trend in our 
culture, is that pupil competition for dominance is direct and often 
undisguised. 

Guilford and Comrey (37) developed a multiple-choice 150-item 
biographical inventory to predict proficiency of school administrative 
personnel. The authors concluded that their negative results were attrib- 
utable to the questionable criterion, superintendent’s ratings, and that 
the biographical inventory has limited use in the selection of school 
administrators. 


Observational Studies of Adjustment 


Mitchell (57) gave a “guess who” test to 873 sixth-grade pupils and 
had behavior rating forms checked by teachers and parents, for the 
purpose of studying the relation of reading to social acceptability. She 
found that appraisals of social acceptability made by teachers, parents, 
and children were not highly correlated, but that extensive reading is a 
significant factor in children’s social acceptability. 

Taves (99) compared a direct (Terman Marital Adjustment Scale) 
and an indirect (Kirkpatrick Family Interests Scale) approach to meas- 
uring marital adjustment. In two experiments he gave both scales to 508 
married people. He concluded that the indirect approach was preferable 
because it overcomes score variations resulting from differential motiva- 
tion of respondents, reduces hostility and conscious manipulation of 
responses by being more subtle. Psychometrically, there were no differ- 
ences in reliability, validity, or ease and economy of administration and 
interpretation. 

Remmers and Shimberg (76) developed the SRA Youth Inventory, a 
checklist of 298 questions covering eight problem areas: school, future, 
career planning, self, getting along with others, home and family, boy: 
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meets girl, health, and things in general. Reported section rel abilities 
range from .75 to .94. Cheney (21) administered a problems qu>stion- 
naire to 1560 high-school students. Areas of difficulty are selecting a 
vocation, training for a vocation, marriage, and getting along with people. 
Cheney’s findings show that many of the most troublesome youth prob- 
lems are discussed only superficially, if at all. Mooney and Price (58, 
59, 60) published three problem checklists, one for each of the college, 
high-school, and junior high-school levels. Price, Bender, and Mooney 
(72) produced another form for rural youth, and Price, Morison, and 
Mooney (73) still another for nursing students. Each of these forms 
averages about 300 items. Hunter and Morgan (40) used a Personal 
Interview Form of 78 items to investigate major problem areas of male 
and female college students. 

Coleman and McCalley (24) investigated nail biting by means of a 
questionnaire to 1000 college students. Fifty-two percent of male and 
54 percent of female students had bitten nails at some time. A sample 
of 54 present nail-biters and 54 who had never bitten nails was given 
the Bernreuter Personality Inventory and a personal data sheet covering 
childhood feelings. Female nail-biters were found to be more introverted 
and to have more current anxiety than nonbiters. Male nail-biters felt 
more anxiety and more often experienced inconsistent discipline and 
lack of independence in childhood. 

Mangus (52) compared the adjustment of third- and sixth-grade rural 
and urban children on the California Test of Personality. He tested 1229 
children. A higher average level of adjustment was found for farm chil- 
dren. Using the same test, administered to 1058 children in eight orphan- 
ages and 207 children in public schools, Edmiston and Baird (28) found 
(a) that the average adjustment of orphanage children was lower than that 
of public-school children, (b) that adjustment tended to deteriorate with 
duration of stay in orphanage, especially after eight years. Powell (70) 
made a correlational analysis of scores on the Bell Adjustment Inventory. 
She found health adjustment independent of home, social and emotional 
adjustment, while social adjustment, home and emotional adjustment 
were found to be substantially intercorrelated. 


Studies of Old People 


The following studies of old people represent interesting methodological 
points. Birren and Fox (11), comparing the two approaches systematically, 
found that in interviewing old people it is preferable to ask for date of 
birth rather than age. Altho birthdates were often more difficult to recall, 
they were more consistent. Havighurst (38) investigated problems of 
sampling and interviewing in studies of old people. He surveyed the 
sociological characteristics of all individuals over 65 years in a mid- 
western town of 6000, and then selected and called on a representative 
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sample of 208. Successful interviews totaled 76 percent; refusals, 13 
percent; and “too ill,” 11 percent. Upper-class women and lower-middle- 
class men were most resistant to being interviewed. Interviewing was 
regarded as more effective than the mail questionnaire approach in 
obtaining information from the elderly. “ 

Cavan (19) studied family life and family substitutes in old age, 
using an inventory, “Your Activities and Attitudes,” constructed by the 
subcommittee on social adjustment in old age of the Social Science 
Research Council. This was administered to 498 men and 755 women 
over 60 years of age. The report analyzed factors associated with change 
from own home to some other—rooming house, boarding house, hotel, 
home of someone else, philanthropic home, or fee home—and described 
patterns of companionship, activities, and attitude characteristics of these 
living arrangements. 

Pressey and Simcoe (71), using a technic similar to Flanagan’s critical 
incident method, compared successful and problem old people. Students 
in a university course on adult life were given an outline and instructed 
to describe one old person of their acquaintance who made a satisfactory 
adjustment and one who was a personality problem. Results were analyzed 
for 349 successful and 204 problem old people. They indicated an advan- 
tage to the adjustment of the elderly in living in small towns, being 
socially active, gainfully occupied, and having diverse interests. 
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CHAPTER VIII 


Tests as Research Instruments 


ROBERT L. THORNDIKE 


The content of this chapter is defined in large measure negatively by 
the limits of other chapters in this and other issues of the REVIEW or 
EDUCATIONAL RESEARCH. The literature on statistical theory is assigned to 
Chapter V, while methodological and substantive studies of factor analysis 
are covered in Chapter IV. The only statistical studies that are reviewed 
here are those which are primarily contributions to testing methodology. 
All of the observational technics, including instruments for self-evaluation, 
are covered in Chapter VII, so studies of questionnaires, inventories, and 
rating scales appear there rather than here. 

Substantive material on achievement tests, intelligence tests, and apti- 
tude measures has been reviewed in the Review oF EpucaTIonav Re- 
SEARCH for February 1950. References cited in that issue are not repeated 
here, and the tendency is to avoid the type of material covered in that 
number. What is left after these various exclusions appears tv be a con- 
sideration of the general theoretical issues relating to test construction 
and test use. 


General Treatments 


The period in question has seen the publication of two general treatises 
on test construction and theory which will represent lasting contributions 
to the testing literature. The first of these is Gulliksen’s Theory of Mental 
Tests (48). Gulliksen undertook the presentation of a complete integrated 
picture of the rational and statistical theory underlying the analysis of a 
single test. No attempt was made to deal fully with multivariate analysis. 
The book brings together under one cover a great deal of material that 
will be useful to the student of test theory. The reviewer was particularly 
interested in the treatments of (a) the effects of heterogeneity and of 
curtailment on a correlated variable, (b) the statistical definition of 
parallel tests, (c) the statistics of speeded tests, and (d) the general logic 
of weighting subtests to yield a total score. 

The second important book is Educational Measurement (60), edited 
by Lindquist and written by 21 contributing authors. This volume is 
divided into three sections, dealing respectively with “The Functions of 
Measurement in Education,” “The Construction of Achievement Tests,” 
and “Measurement Theory.” Specific chapters will be mentioned in con- 
nection with specific topics. 

In addition to the above two sources, Thorndike (70) brought out a 
text dealing with the phases of test development and analysis for per- 
sonnel selection. The treatment is oriented around the use of tests for 
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selecting and classifying military, civil service, or industrial personnel. 
Introductory texts on test preparation, designed primarily for the class- 
room teacher, have been prepared by Travers (72) and Micheels and 
Karnes (62). An extensive bibliography of selected references on test 
construction, mental test theory, and statistics has beem prepared by 
Goheen and Kavruck (43). 


Preparation of Test Items 


Guidance in the preparation of test items has been provided by Ebel 
(60), while Davis (60) commented on editorial considerations in item 
writing. Flanagan (36) and Travers (73) have each protested against the 
purely empirical approach to item preparation and have urged the impor- 
tance of rational analysis and the formulation of definite hypotheses as 
the basis for item preparation. Travers contrasted the approach of the 
technician, who is only interested in empirical validity, with that of the 
scientist, who is interested in developing and testing hypotheses, and 
pleaded for more of the scientific approach in test construction. Flanagan 
indicated the importance of determining the critical requirements of any 
job or segment of education, analyzing the knowledge and skill required 
to sueceed in those requirements, and relating each test item directly to 
some required knowledge or skill. 


Item Analysis 


Davis (60) gave a comprehensive discussion of the logic and procedures 
for item analysis, and indicated appropriate ways of using item analysis 
data. The literature is covered thru 1949 and is supplemented by the 
author’s own critical discussion of such problems as correction for chance, 
optimum difficulty distributions, and the appropriate use of item statistics 
in preparing different types of tests. 

A number of reports consider specific aspects of item difficulty. Cadwell 
(16) reported data which confirms previous indications that judges can 
estimate the relative difficulty of test items with fair success, but are not 
able to make accurate judgments of absolute difficulty level. The problem 
of using word frequency counts as indicators of difficulty of vocabulary 
test items received the attention of several writers (29, 56, 79). The rela- 
tionship appears to be very slight within the range of commoner words 
and when vocabulary knowledge is measured by testing precise choice of 
meaning. When relatively rare words are included and when only broad 
discriminations of meaning are required, a substantial relationship 
appears. There appear, thus, to be two somewhat distinct aspects of 
vocabulary involved—range and precision. 

The optimum shape of test score distribution was discussed by Ferguson 
(34). Ferguson indicated that when the function of a test is to make the 
maximum number of discriminations among individuals tested, the opti- 
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mum shape of distribution is rectangular, and that this shape of distribu- 
tion can be approximated by selecting items within a narrow range of 
difficulties around the 50 percent level. There are, of course, as Davis (60) 
indicated, a number of other purposes for which a test may be used, for 
which quite different score distributions may be required. 

Mollenkopf (63) has investigated the effect of position in the test and 
speeded vs. unspeeded administration on indexes of item difficulty and 
discrimination. Both types of indexes were found to be disturbed for the 
later items of speeded tests. The effect was judged due to the distorting 
selection introduced in a speeded test, where those who complete the test 
tend to be the able on the one hand and the careless on the other. 

The relative precision of different indexes of item-test correlation has 
been compared by Doppelt and Potts (31) and by Flanagan (35). The 
results are in agreement in indicating that correlations estimated from the 
upper and lower 27 percent are only slightly less precise than biserial 
correlations, tho both of these show sampling fluctuations substantially 
larger than those to be expected for a product-moment correlation with 
the same size of group. 

Bedell (5) developed a routine for determining the number of items 
from a test to retain for maximum validity, while Gulliksen (46) and 
Gleser and DuBois (42) each proposed approximation procedures for 
selecting from a total group of items the subset which yields a score with 
maximum validity. All these procedures are concerned with maximizing 
the correlation for the specific sample. It is not clear for any of them that 
the correlation will still be a maximum in a new sample in which regres- 
sion effects change the relative size of correlations for different items. 

Walker (1) proposed applying the logic of sequential analysis to item 
analysis. In this application of sequential analysis, one examines item data 
for a small number-of pairs of cases from the upper and lower extremes 
of the group and decides that the results (a) require rejection of the 
hypothesis of zero correlation between item and test, (b) are completely 
compatible with the hypothesis of no relation, or (c) do not permit a 
decision. In the last case, additional cases are added pair by pair until a 
decision is possible in one direction or the other. Where analysis was 
being done by hand, without benefit of IBM equipment, the sequential 
procedure could presumably result in a substantial time saving. However, 
the only decision which it permits is that an item’s correlation with a 
criterion score is or is not different from zero. This is rarely a useful piece 
of information when carrying out an item analysis, since practically every 
item will have a positive validity coefficient, and the decision which one 
must make is which are the most desirable items to use from among a 
group all of which have positive correlations with the total score. 

The sequential analysis type of thinking has also been discussed by 
Moonan (64) in connection with the use of tests and test scores and by 
Kimball (55) in connection with the checking of test scoring. In the first 
instance, the proposal is that those pupils whose performance on a sample 
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of test material permits a decision that they surpass a required limit be 
exempted from further testing and spend their time in some other way 
(an idea related to the procedure that has occasionally been practiced in 
schools of exempting certain students with superior class records from 
final examinations). In the second case, the idea is to rescore a limited 
sample of a set of papers and to continue check-scoring until it is possible 
to state with specified confidence that the scoring does or does not meet 
specified limits of accuracy. In spite of the interest of these investigators, 
the present writer doubts whether sequential analysis has any very impor- 
tant contribution to make in the construction and use of tests. 

In using tests, weighting the part scores when they are combined is 
always a nuisance. Horst (49, 51) developed a technic for calculating 
what part of a specified total testing period (not including time for 
instructions and practice) should be allotted to each subtest to give a 
maximum prediction of a criterion. The solution applies, of course, to 
the present sample, and the lengths will not, in general, be those which 
will yield the most valid test in a new sample. 


Reliability and Homogeneity 


Problems of estimating the precision and singleness of meaning of the 
score resulting from a test continue to attract attention and arouse con- 
troversy. Thorndike (60) has reviewed the underlying logic, the experi- 
mental procedures for gathering data, the statistical procedures for 
computing indexes, and the uses and limitations of reliability data. Horst 
(50) has developed a generalized formula for estimating reliability which 
is applicable when the number of scores or ratings varies from person 
to person. 

A number of writers have been concerned with the concept of test 
homogeneity. Roughly speaking, a homogeneous test is one in which all 
the items are measuring the same trait or the same combination of traits. 
However, just what is to serve as an index of homogeneity is not clear. 
The extent to which a set of items will “scale.” as defined by Stouffer 
(68) and others (20), seems to be the criterion for some. However, low 
“sealability” may result from instability of response to single items as 
well as from heterogeneity of the items in a set. Carroll (18) considered 
other possible indicators of homogeneity. He suggested that the items of 
a test be sorted into groups for general difficulty level and the examinees 
be sorted into groups with respect to total test score. If a three-dimensional 
plot is now prepared in which one dimension is item difficulty and a 
second is total score level, and if in the third dimension we plot the per- 
cent succeeding with each item in each ability category, these percentages 
should fall on a smooth surface if the test is to be considered homogeneous. 

Gage and Damrin (38) investigated the properties of the type of homo- 
geneity index proposed by Loevinger. They found that it yielded numerical 
magnitudes completely different from the standard reliability coefficient, 
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tho both are indexes with a maximum possible value of 1.00, and that 


sta 
the homogeneity index was unrelated to test length. Furthermore, the tes 
index appeared to be a function of item difficulty, tending to be a maxi- qu 
mum when the items are widely spaced in difficulty. Kriedt and Clark m 
(57) compared the characteristics of a test composed of items selected te 
by “scale analysis” with one based on traditional item analysis procedures. st 
Seale analysis yielded a less reliable test and one with items which split al 
the group in quite uneven fractions, tho the test did possess a higher 
degree of “scalability.” fi 
The concern with homogeneity may eventually prove fruitful, but it d 
seems doubtful that procedures which have so far been developed to s 
produce or express it are useful in the preparation of ability tests. How 
homogeneous a test should be depends, of course, on the purpose for 
which the scores are to be used. 


Gulliksen (47) and Cronbach and Warrington (23) have attacked the 
problem of getting from a single test administration a usable estimate 
of the reliability of a speeded test. The ordinary split-test or Kuder- 
Richardson procedure has a spurious element which tends to make it an 
overestimate. These authors developed formulas to indicate a lower bound 
for the reliability coefficient. It was pointed out that when the split-test 
reliability and the lower-bound estimate differ only slightly, the reliability 
can be bracketed within useful limits. When the upper and lower bounds 
differ widely, no useful estimate can be obtained, and one must fall back 
on the separate administration of parallel forms of the test. 

Clark (19) reported an empirical analysis of the effect of different 
item splits upon the split-half reliability coefficient. Using intelligence 
test material, he concluded that the particular item split is insignificant 
as a source of variation. 

The reliability of difference scores which. serve as the basis for differ- 
ential diagnosis and guidance has come in for empirical study and 
theoretical discussion. Doppelt and Bennett (30) have reported evidence 
on the reliability over a three-year period for differences between pairs 
of tests in the Differential Aptitude Test Battery. The difference scores 
have a reliability of about .50, as compared with one of .70 to .75 for the 
component scores. Derner, Aborn and Canter (28) and Gilhooley (40) 
reported reliability data for the subtests of the Wechsler, and questioned 
the extensive use made of differences between these only moderately 
reliable and rather substantially correlated subscores, 


Validity and the Criterion 





A comprehensive treatment of the topic of validity was written by 
Cureton (60). More limited discussions of theoretical problems of validity 
were prepared by Anastasi (3) and by Gulliksen (45). Anastasi empha- 
sized that validity refers to some practical criterion rather than to some 
hypostatized trait, and that an extraneous factor, such as socio-economic 
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status, is to be considered a distorter of test scores only if it affects the 
test scores without similarly affecting the criterion measure. Gulliksen 
questioned the desirability of unquestioningly accepting “expert judg- 
ment” in selecting a criterion variable. He proposed that alternate cri- 
terion measures be studied in terms of their intercorrelations and factor 
structure, and that the sensibleness of the correlations between criterion 
and predictors serve as one cue to the appropriateness of the criterion. 

There have been several articles oriented around producing more satis- 
factory practical indexes of the effectiveness of tests. Brogden (9) 
developed a coefficient of selective efficiency which under certain circum- 
stances is equivalent to a biserial or product-moment correlation coefficient, 
and which indicates what proportion of the maximum possible gain in 
criterion score results from selection at a specified level on a particular 
predictor. Jarrett (52) proposed the use of percent increase in output as 
a practical measure of the effectiveness of a selection procedure in those 
cases in which number of units produced is a meaningful criterion vari- 
able. Brogden and Taylor (11) developed the notion of “the dollar 
criterion” as a common yardstick for appraising personnel procedures— 
an appealing notion but one fraught with numerous practical difficulties. 
Brogden and Taylor (12) have also considered the various classes of 
deficiencies in criterion measures and discussed ways of combating them. 

Brokaw (13) tested empirically the theoretical relationship between 
the length of the subtests in a battery and the validities of the resulting 
composite scores. His results were in accord with expectation, in that 
cutting the length of the separate tests in half produced only a slight drop 
in the validity of the composite. 

Travers and Wallace (74) discussed the problem of inconsistent validity 
data in successive groups. The discussion was illustrated by data from 
two consecutive classes in a school of dentistry. Change in the subjective 
weighting of entrance criteria was proposed as a probable explanation. 


Cross Validation 


A set of papers discussed the importance of cross validation, i.e., the 
verification of test weights or item selection on a new sample. Mosier 
(65) distinguished among a number of related concepts which come 
within the general area of cross validation. Katzell (54) indicated the 
crucial role of cross validation when items are being selected by validity 
against an external criterion, and Cureton (26) documented this point by 
a synthetic example. Cureton (25) indicated that, especially where pre- 
dictors are highly correlated, much of the difference in regression weights 
between variables may be due to chance fluctuations, and proposed a 
principal components type of analysis to provide a closer approximation 
to “true score” regression weights. Wherry (80) tried to reconcile the 
advantages of theoretical analysis of regression weights and empirical 
tests by cross validation. 
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Special Types of Scoring and Scaling 


A number of investigators have considered problems relating to the 
more effective scoring of tests or scaling of the resulting scores. Guilford 
and Michael (44) discussed the problem of developing relatively univocal 
scores of pure factors. They feel that since in many cases the purity of 
a test is contaminated by only a single additional factor, the use of a 
single suppressor variable will often be an effective procedure for obtain- 
ing a relatively pure score. Feifel and Lorge (33) studied the qualitative 
differences between vocabulary responses of children and adults, indi- 
cating the possible value of qualitative differentiations in scoring. 

Fruchter (37) studied the information to be used from working with 
error scores. In some speeded tests, number wrong and number right have 
quite low intercorrelations. Fruchter found that a number of error scores 
defined a distinct factor, which might perhaps be considered to be “care- 
fulness.” Glaser (41) criticized “inconsistency scores,” based on the 
number of changes in response from one testing to a later testing, as 
involving a basic statistical artifact. He points out that the number of 
changes which an individual makes will depend upon the number of 
items which are at or near the threshold of reaction for that individual. 

The problem of handling patterns of item responses was discussed by 
Meehl (61), who pointed out that it is possible for items to have joint 
validity for discriminating criterion groups even tho neither of them has 
validity individually. This will happen when the intercorrelation of the 
items is different in each of the criterion groups. Meehl indicates that 
empirical studies are under way to determine whether “configural scoring” 
appears profitable for items of the Minnesota Multiphasic Personality 
Inventory. Cronbach (21) developed a procedure which facilitates the 
joint tabulation of three subscores and the search for patterns of the three 
which discriminate criterion groups. Du Mas (32) developed an expres- 
sion for similarity between pairs of profiles based upon the percent of 
like-signed lines in the two profiles. However, the test of significance for 
this index is faulty, since it assumes independence in the signs of the 
different segments. Mosier (60) prepared a general critique of profiles 
as devices for representing test results. 

The use of special testing devices which give the student immediate 
knowledge of results and which permit him to try alternate responses 
until he reaches the correct one has been studied by Pressey (66) and 
by Jones and Sawyer (53). It is reported in each case that this type of 
testing procedure results in improved learning. 

A procedure for scaling test results which does not depend upon assump- 
tions of normality has been developed by Gardner (39). The routine 
makes use of tables of the Pearson Type III curve, and is applicable to 
distributions with any degree of skewness. Flanagan (60) prepared a 


general discussion of the problems of units and scaling in psychological 
tests. 


456 
















December 1951 Tests AS RESEARCH INSTRUMENTS 





Distortions of Test Performance 


Cronbach (22) presented further evidence of “response sets,” i.e., such 
individual idiosyneracies as readiness to guess, emphasis on speed as 
opposed to accuracy, preference for some particular response option, and 
the like, upon test performance. A number of suggestions are offered for 
minimizing these influences. Cross (24) confirmed previous studies which 
show that the self-report type of inventory is readily subject to faking. 


Differential Prediction 


The basic theoretical and practical problems in using a test battery for 
differential prediction of success in a number of fields, which were the 
focus of so much work during World War II, continue to attract interest. 
The problems were reviewed briefly by Wesman and Bennett (78). They 
questioned the desirability of concentrating on tests which differentiate 
two fields to the exclusion of those which measure qualities common to 
both. Thorndike (71) stated the problems involved both in designing a 
battery to be used in assigning men to many categories and in setting up 
administrative procedures for the use of the resulting scores. 

Brogden (10) investigated the gain in effectiveness of differential 
assignment resulting from the use of several different predictor variables 
rather than only a single predictor. Brogden’s analysis is limited to the 
specific situations where either (a) assignment is made between only two 
different categories, or (b) there is zero correlation between all the pre- 
dictor scores. The results from these special cases made him perhaps 
unduly optimistic as to the gains to be expected from differential predic- 
tion in the general case in which there are a number of assignment cate- 
gories and the predictor scores for all of the categories have substantial 
correlations. 


Tests in the Study of Communities 


Recent nationwide testing programs have made possible several studies 
in which the community rather than the person is the unit being studied. 
Davenport and Remmers (27) related state means on the World War II 
A-12 V-12 test to various socio-economic facts about the states, finding 
correlations as high as .83. Thorndike (1) studied community variables 
related to intelligence and achievement test scores of school children 
tested in the standardization of the revised Metropolitan Achievement 
Tests. He found the community variables to give fairly substantial predic- 
tions of the children’s intelligence test score, but only slight prediction 
of the achievement measures. Lennon (59) studied the interrelations of 
community intelligence and achievement measures using the same source 
of data. 
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New Tests 


An important addition to the stock of individual intelligence tests for 
children is the Wechsler Intelligence Scale for Children (77). This test 
is made up of the same subtests and yields the same types of scores as 
the well-known scale for adults. The Leiter International Performance 
Scale, 1948 Revision (58) provides an individual performance test for 
adults. A nonreading vocabulary test, requiring subjects to indicate to 
which one of a set of pictures a particular word relates has been developed 
by Ammons and Ammons (2). 

In the testing of motor performance, Van der Lugt brought to this 
country and translated for use a series for the testing of manual ability 
for adults (75) and a psychomotor test series for children (76). Both 
were originally developed in Europe, with norms and other statistics on 
European groups. 

The measurement of listening comprehension has been an area of 
marked activity, and reports of tests and their use have been prepared 
by Blewett (7), Brown (14), and Spache (67). Typically, these tests 
seem to have fairly adequate reliability and low enough correlations both 
with measures of reading and with measures of general intelligence to 
indicate that they are measuring skills which are in some degree unique. 
Increased dependence on newer audio-visual media of instruction will 
make measures of ability to learn by looking and listening as important 
as ability to learn by reading, as Caffrey (17) pointed out in his plea 
for “auding-age” norms. 

Two variations of the Thematic Apperception Test have been prepared 
with the objective of adapting the materials to use with specific groups. 
Thompson (69) prepared a form incorporating Negro figures for use 
with Negro subjects, while Bachrach and Thompson (4) developed a 
form showing figures of crippled individuals for use with the handicapped. 
Bellak and Bellak (6) prepared a test for children which makes use of 
animal figures. The Blacky Pictures, prepared by Blum (8), have been 
published for research use. This series again makes use of a dog (Blacky) 
as the central figure in a series of pictures designed to permit sex-related 
interpretations. 

An inventory of interests and activities has been prepared by Burgess, 


Cavan, and Havighurst (15) for the specific purpose of work with adults 
60 and over. 


Concluding Statement 


The reviewer has not identified in the work of the past three years 
which he has reviewed, any instances of especially noteworthy advance 
in test invention or in test theory. There has been real progress in con- 
solidating and organizing test doctrine for the user, and certain ones of 
the new tests will undoubtedly give good service. 
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Handicapped children, 78, 93; school- 
plant facilities for, 19 

Health education, 199 

Heating, 40; needed research on, 67 

Historical research, 340 

Historiography, 340 

Home economics, classrooms, 21 

Homogeneity, of tests, 462 

Human relations, teaching of, 204 


Illumination, 20, 36 

Industrial arts, classrooms, 22 

Inservice teacher education, arithmetic, 
319 

Instructional materials, 266; elementary 
science, 280; geometry, 309; graphic, 
286; laboratory, 286; mathematics, 296; 
pamphlets, 283; science, 279, 281 

Insurance, fire, 60; of school plants, 60 

Intelligence, growth of, 76 

Intercultural education, 79, 140, 143 

Interest inventories, 120 

Interest rates on bonds, 54 

Interests, 79 

Interview technics, 116, 450 

Item analysis, 460 


Laboratory experiences, criteria for, 286; 
relation to content of textbooks, 288 

Laboratory methods, 287 

Leadership, 142; in discussion groups, 
143 

Learning, 186; and the curriculum, 186, 
188; exceptional children, 189; theory, 
188 

Learning materials, 220; content, 221; 
readability and difficulty of, 220 

Legal control of education, 181 

Legislation for building construction, 6 

Libraries, curriculum, 213 

Library resources, 338 

Life adjustment education, 202 

Lighting, 20; and decorating, 37; natural, 
37; needed research on, 67; principles, 
36 

Literacy, 149 
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Maintenance, needed research on, 68; of 
school plants, 59 

Materials, school-plant, 28, 33, 53 

Mathematics, 290, 305; achievement, 297; 
aims and purposes, 305; college, 305; 
contents of courses, 306; counseling in 
secondary program, 310; curriculum, 
305; elementary school, 290; in second- 
ary schools, 305; in junior high school, 
290, 298; methods, 291; predicting suc- 
cess, 297, 310; remedial procedures, 
310; teacher education, 312 

Measurement, see Evaluation 

Mental development, 76 

Mental hygiene and guidance, 188 

Mentally retarded, 96 

Methods, computational, 432; factor anal- 
ysis, 377; inductive vs. deductive, 287; 
in special education, 97; meaning ap- 
proach, 291; science, 264 

Methods of teaching, 211; arithmetic, 
191, 212; exceptional children, 211; 
films on, 214; geometry, 191, 213; lan- 
guage skills, 212; reading, 213; mathe- 
matics, 291, 308 

Methods research, 360 

Migration of workers, 152 

Motion pictures, 284; see also Audio- 
visual aids 

Motor development, 75 

Multivariate analysis, 413 

Music classrooms, 23 


Needed research, classroom planning, 
65; college general education science 
courses, 275; curriculum, 233; factor 
analysis, 390; finance, 65; heating, 67; 
learning materials, 225; lighting, 67; 
mathematics, 290; school plant, 64; 
school plant operation, 68; sites, 66; 
survey and trend technics, 354; teacher 
education, 318; ventilation, 67 


Objectives and goals of the curriculum, 
173, 179, 186; of mathematics teaching, 
305; of science teaching, 249 

Observational research, 365 

Observational technics, 441 

Occupations, census statistics, 152; 
trends, 155 

Operation, needed research on, 68; school 
plant, 57 

Opinion research, 348, 449 

Organization of the curriculum, 196 

Pamphlets, 282 

Personality, 76; factor analysis of, 389; 
inventories, 117; measurement of, 115; 
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and social status, 116 

Personnel, custodial, 57 

Personnel records, 86; selection, 124 

Philosophy, 173 

Physical education, school-plant facilities, 
24 

Physical growth and development, 75 

Physical handicaps, 93 

Planning, classrooms, 17; coordination, 
12; needed research in, 65; organiza- 
tion, 11; personnel, 11; procedures, 
10; school-plant design, 29; use of 
committees in, 11; use of consultants 
in, 12 

Plant, see School plant 

Prediction of achievement, 121 

Prediction, differential, 466 

Preservice teacher education, arithmetic, 
319 

Probability theory, 400 

Problem-solving, in arithmetic, 296; in 
mathematics, 300; in science, 270 

Projective technics, 115, 118 

Promotion policies, 92 

Psychodrama, 142 

Psychotherapy, 132, 140 

Punched-card technics, 431 


Questionnaires, 450 


Radio, 214 

Readiness, in arithmetic, 294 

Reading difficulty, pamphlets, 282; sci- 
ence textbooks, 282 

Recordings, 214 

Records, personnel, 86 

Reliability, of tests, 462 

Religious education, 176, 197; church 
and state, 177; place in secular educa- 
tion, 176 

Remedial instruction, in arithmetic, 294, 
310 

Research, hypotheses for, 186; involving 
correlation studies, 372; involving fac- 
toral designs, 367; involving identifica- 
tion of factors, 369; involving interre- 
lated classifications, 366; involving 
matching, 362, 368; involving obser- 
vational procedures, 441; opinion, 449; 
responsibilities for, 182 

Resource units, 209 

Resources, financial, 47 

Role-playing, 142 


Safety, in building construction, 28 
Salary schedules, for custodians, 58 
Sampling theory, 411 
Sanitary facilities, 39 
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School plant, facilities for special educa- 
tion, 19; federal aid, 6, 49; finance, 
44; future needs, 5; heating and venti- 
lation, 40; lighting, 36; maintenance, 
59; materials, 28, 33, 53; modernization 
of, 61; multiple use of, 19; needed re- 
search, 64; operation, 57; planning, 5, 
10; plumbing and sanitation, 39, 59; 
programs, 5; rehabilitation, 60; space 
requirements, 17; state aid, 6; status, 5 

Science, 249, 264; classrooms, 24; col- 
lege, 256; curriculum, 251; difficulties 
in teaching, 279; elementary school, 
268; enrolments, 252; instructional ma- 
terials, 279; methods, 264; secondary 
school, 252; teacher education, 275; 
textbooks, 271, 281 

Science teaching, 264, 279; aims and pur- 
poses, 249; and attitude change, 259, 
269 

Scientific method, and critical thinking, 
270; in textbooks, 271; nature of, 250, 
271; teaching procedures, 271 

Self-surveys, 351 

Sensory aids, in mathematics, 296; in sci- 
ence teaching, 280 

Sites, 29; needed research on, 66 

Social adjustment, 78 

Social aspects of arithmetic, 293 

Social learning, 191 

Social status and behavior, 79; measure- 
ment of, 120; and personality, 116; 
and social acceptance, 116 

Sociodrama, 142 

Sociometrics, 446 

Sociometry, 117, 142 

Space requirements, 17 

Special education, 78, 94; methods and 
materials, 97; school-plant facilities 
for, 19 

State aid for school plants, 6, 49 

Statistical inference, 401 

Statistical theory, 398 

Student participation in curriculum plan- 
ning, 182 

Supervision and curriculum development, 
228; nature of, 228; organization for, 
229; practices and procedures, 231; 
technics, 231 

Surveys, technics, 346, 449 


Tables, mathematical, 430 

Teacher education, arithmetic, 317; in- 
service, 319; mathematics, 312; pre- 
service, 319; recommendations, 314, 


318; science, 275 

Teaching aids, business-sponsored mate- 
rials, 283; current materials, 266 

Teaching materials, 209; evaluation of, 
182, 210; for visually handicapped, 
213; sources of, 210, 214 

Teaching technics, 264, 290, 305; biology, 
265; chemistry, 264; college science, 
256, 266, 274; enrichment, 267; film 
vs. demonstration, 267; mathematics, 
290, 308; scientific method, 271; semi- 
micro- vs. macro-methods, 277 

Technics, appraisal of trend and survey, 
353; case history, 444; computational 
424; controlled diary, 444; interview, 
450; observational, 441; punched card, 
431; questionnaire, 450; rating, 446; 
sociometric, 446; survey and opinion 
research, 449 

Television, 214 

Test construction, 390 

Tests, aptitude, 115, 120; distortions of 
performance, 465; homogeneity, 462; 
of interests, 120; new, 467; personal- 
ity, 115; reliability of, 462; as research 
instruments, 459; scoring and scaling, 
465; of statistical hypotheses, 401; 
validity of, 463 

Textboc ks, 220; arithmetic, 223, 294; 
evaluation of, 182; geography, 223; 
junior high-school science, 281; psy- 
chology, 222; readability and difficulty 
of, 220; readers, 221; science, 222; 
scientific method in, 271; statistics, 
398; use of, 224 

Theory, of statistical decision, 408; of 
statistical estimation, 405; probability, 
400; sampling, 411; statistical, 398 

Toilet facilities, 18, 39 

Trends, studies of, 350; technics of study- 
ing, 346 

Truancy, 91 

T-tests, errors in application of, 372 


Validity, of tests, 463 

Values, 175 

Ventilation, 40; needed research on, 67 
Veterans, guidance, 110 

Visual aids, see Audio-visual aids 
Vocabulary studies, in science, 281 
Vocational education, 151; selection, 122 
Vocational guidance, of handicapped, 97 


Work experience, 201 
Work standards, for custodians, 58 
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